Gemma 4 Specs: Model Sizes, Context Window, Features & Performance

What Is Gemma 4?

Gemma 4 is the current core Gemma model family from Google DeepMind. Google’s official Gemma overview describes it as a family of open models meant for a wide range of generation tasks, including question answering, summarization, reasoning, coding, and multimodal understanding. The release page documents that Gemma 4 launched on March 31, 2026 in four sizes: E2B, E4B, 31B, and 26B A4B. That release framing is important because it shows Google is not treating Gemma 4 as one single checkpoint, but as a structured family with distinct deployment targets ranging from edge-friendly scenarios to workstations and more powerful systems.

The model overview also says Gemma 4 spans three architecture categories. The first is small effective-parameter models, built for ultra-mobile, edge, and browser deployment. The second is a 31B dense model for high-end performance that still aims to stay reachable for local execution on stronger hardware. The third is a 26B MoE variant designed to combine strong reasoning performance with high throughput and efficient inference characteristics. Right away, that tells you Gemma 4 specs are not just about “bigger is better.” They are about different design trade-offs.

💡 Quick summary

Gemma 4 is a four-model family with two small effective-parameter models, one large dense model, and one large MoE model, all positioned around practical deployment rather than a single one-size-fits-all design.

Gemma 4 Model Sizes and Core Specs

The most useful starting point for Gemma 4 specs is the official models overview table in the model card. That table clarifies the four variants and shows how Google wants the family to be understood: E2B and E4B as compact efficient models, 31B as a dense flagship-style model, and 26B A4B as the MoE option. The small models are especially interesting because Google does not just list raw parameter counts. It lists effective parameters and then also notes larger embedding-inclusive totals, which is a clue that these models use a different efficiency strategy than a conventional dense model.

Model	Architecture	Core parameter figure	Layers	Context	Modalities
Gemma 4 E2B	Dense, effective-parameter small model	2.3B effective, 5.1B with embeddings	35	128K	Text, Image, Audio
Gemma 4 E4B	Dense, effective-parameter small model	4.5B effective, 8B with embeddings	42	128K	Text, Image, Audio
Gemma 4 31B	Dense	30.7B	60	256K	Text, Image
Gemma 4 26B A4B	MoE	25.2B total, 3.8B active	30	256K	Text, Image

These numbers already tell an important story. The E2B and E4B models are not just smaller versions of the 31B. They are intentionally optimized for efficient local execution. The 26B A4B model is also not a normal “26B dense” model. Its MoE setup means its active parameter count during inference is far smaller than its total parameter count. That is one of the most important Gemma 4 specs to understand because it has direct implications for speed, throughput, and deployment economics.

Architecture: Dense, Effective-Parameter, and MoE Designs

Google’s documentation makes clear that Gemma 4 does not follow a single architectural pattern across the whole family. Instead, it uses different approaches depending on the deployment target. The small models use what Google calls effective parameters, while the 31B is a straightforward dense model and the 26B A4B is a Mixture-of-Experts model. This matters because “parameter count” alone is no longer enough to explain how a Gemma 4 model behaves or why it might be a good fit for a specific use case.

For E2B and E4B, Google explains that the “E” stands for effective parameters. The model card says these smaller models incorporate Per-Layer Embeddings, or PLE, to maximize parameter efficiency in on-device deployments. Each decoder layer gets its own small embedding for every token. Those embedding tables are large, but since they are used for quick lookups rather than full active compute in the same way as the rest of the model, the effective parameter count stays much lower than the total count with embeddings included. This is one of the most interesting parts of the Gemma 4 spec sheet because it shows Google is optimizing aggressively for edge and local efficiency rather than just shrinking a standard dense architecture.

The 26B A4B model takes a different route. It is a Mixture-of-Experts design with 25.2B total parameters but only 3.8B active parameters during inference. The model card also notes an expert count of 8 active out of 128 total plus 1 shared expert. Google says this lets the 26B A4B run much faster than its total parameter count might suggest and makes it a strong choice for fast inference relative to the dense 31B. That one sentence alone explains why a lot of people are especially interested in the 26B A4B specs: it aims to offer a strong capability-to-speed ratio rather than only raw size.

🧩 E-model efficiency

E2B and E4B use Per-Layer Embeddings to keep effective parameter counts smaller than the total embedding-inclusive size, which helps for on-device and local deployment.

🏗️ Dense flagship

The 31B dense model is the straightforward high-capability dense option in the family, with 60 layers and a 256K context window.

⚡ MoE speed strategy

The 26B A4B MoE model uses 3.8B active parameters during inference, which is the key reason it can behave much faster than its total size implies.

Context Window, Sliding Window, and Long-Context Design

One of the most important Gemma 4 specs is the context window. Google’s overview and model card both emphasize that Gemma 4 supports long contexts: 128K for the small E2B and E4B models, and 256K for the medium-size 26B A4B and 31B models. That is a major part of how Gemma 4 is positioned. It is meant to handle long documents, extended conversations, complex retrieved context, and more ambitious agent workflows than older smaller local-friendly models typically could.

The architectural details go further. Google says the models use a hybrid attention mechanism that interleaves local sliding-window attention with full global attention, with the final layer always global. The design is explicitly described as a way to keep the processing speed and low memory footprint of a lightweight model while preserving the deeper awareness needed for complex long-context tasks. The documentation also mentions unified Keys and Values in global layers and Proportional RoPE, or p-RoPE, as part of memory optimization for long contexts. These are not just abstract engineering notes. They explain why Gemma 4 can target long context without giving up the deployment efficiency story that Google keeps emphasizing.

Sliding-window sizes differ too. E2B and E4B use 512-token sliding windows, while the 31B dense and 26B A4B MoE models use 1024-token sliding windows. That is another subtle but useful Gemma 4 spec, because it shows the larger models are tuned for broader local context within their hybrid attention flow.

⚠️ What the long context spec does not mean

A 128K or 256K context window is a capability limit, not a guarantee that every workflow should stuff that much text into a prompt. Good retrieval and good prompt structure still matter.

Modalities, Languages, and Core Capabilities

Gemma 4 is explicitly multimodal. The model card states that all Gemma 4 models handle text and image input and generate text output. It also says audio is supported on the smaller models. The dense-model table then shows E2B and E4B supporting Text, Image, and Audio, while the 31B supports Text and Image. The MoE 26B A4B model is also listed as Text and Image. Google’s overview page adds that video is featured natively on the E2B and E4B models, alongside audio. Together, that means the family is broader than a simple text-only LLM lineup.

The model card also lists several core capabilities: thinking mode, long context, image understanding, video understanding, interleaved multimodal input, function calling, coding, multilingual support, and audio tasks on the smaller models. On language coverage, Google says Gemma 4 is pre-trained on over 140 languages and offers out-of-the-box support for more than 35 languages. That is a useful distinction. The model has been exposed to much broader multilingual data, but the strongest out-of-the-box performance claim is framed more narrowly.

Spec area	Official Gemma 4 detail	What it means
Text and image	Supported across all Gemma 4 models	The whole family is multimodal at a baseline level.
Audio	Native on E2B and E4B	The smaller models are not stripped-down toys; they include important media capabilities.
Video	Documented as featured natively on E2B and E4B	Video understanding is part of the small-model story, not only the high-end story.
Multilingual	35+ languages out of the box, pre-trained on 140+ languages	Useful for international workflows and multilingual products.
Function calling	Native support	Gemma 4 is designed with agentic and structured tool-use scenarios in mind.

Model-By-Model Spec Breakdown

Gemma 4 E2B is the smallest model in the family, with 2.3B effective parameters and 5.1B parameters including embeddings. It has 35 layers, a 512-token sliding window, a 128K context length, a 262K vocabulary size, text-image-audio support, an approximately 150M parameter vision encoder, and an approximately 300M parameter audio encoder. This is clearly the most edge-leaning core Gemma 4 model in the documented lineup.

Gemma 4 E4B scales that formula up to 4.5B effective parameters and 8B with embeddings, 42 layers, the same 512-token sliding window, the same 128K context window, the same 262K vocabulary, and the same text-image-audio modality mix. It also keeps the approximate 150M vision encoder and 300M audio encoder. In many ways, this looks like the practical “balanced small model” in the family.

Gemma 4 31B is the dense flagship. It has 30.7B parameters, 60 layers, a 1024-token sliding window, a 256K context window, a 262K vocabulary, and text-image support. Its vision encoder is listed at approximately 550M parameters, and there is no audio encoder. This is the model you look at when you want the full dense-model specification story.

Gemma 4 26B A4B is the MoE alternative. It has 25.2B total parameters, 3.8B active parameters, 30 layers, a 1024-token sliding window, a 256K context window, a 262K vocabulary, an expert layout of 8 active / 128 total plus 1 shared, and text-image support with an approximately 550M vision encoder. In practice, this is the most spec-interesting model in the lineup because its performance and efficiency story depends so heavily on the difference between total and active parameters.

Memory, Quantization, and Deployment Fit

Google’s model overview page explicitly notes that Gemma 4 models are available in 4 parameter sizes and can be used at default precision or lower precision with quantization. The documentation positions these precision choices as trade-offs between capability, memory cost, power consumption, and runtime efficiency. That makes the spec story more practical, because parameter count alone never tells the full deployment story. The same model can behave very differently in cost and feasibility depending on whether you run it at 16-bit or with aggressive quantization.

The overview page also gives a deployment framing that is worth repeating: the small E2B and E4B models are built for ultra-mobile, edge, and browser deployment, while the 31B is the server-grade dense bridge model and the 26B MoE is tuned for efficient advanced reasoning. That means the “right” Gemma 4 spec is not always the biggest one. Sometimes the most relevant spec is the one that fits your hardware, latency target, and product surface without becoming operationally painful.

1

E2B

Best understood as the smallest, most edge-oriented Gemma 4 spec set, suitable for highly constrained deployments.

2

E4B

A balanced small-model spec that still keeps multimodal and long-context functionality in reach.

3

26B A4B

The most deployment-interesting large model if you care about throughput and MoE efficiency.

4

31B

The clearest dense high-end option when you want maximum capability inside the official family without using the MoE route.

Benchmark Highlights

The model card includes a benchmark table that spans reasoning, coding, vision, audio, and long-context measures. Some of the headline numbers are especially useful if you want a quick picture of the hierarchy inside the family. On MMLU Pro, the 31B scores 85.2% and the 26B A4B scores 82.6%, well ahead of E4B at 69.4% and E2B at 60.0%. On AIME 2026 without tools, the 31B and 26B A4B are both close to 89%, while the small models trail far behind. On LiveCodeBench v6, the 31B leads at 80.0%, followed by the 26B A4B at 77.1%. These are the kinds of numbers that explain why the larger models are being positioned as serious reasoning and coding options rather than only academic curiosities.

The vision side is strong too. On MMMU Pro, the 31B posts 76.9% and the 26B A4B 73.8%, with E4B at 52.6% and E2B at 44.2%. On MATH-Vision, the 31B reaches 85.6% and the 26B A4B 82.4%. Audio results are only reported for the small audio-capable models, which matches the modality tables. And on long-context MRCR v2 8-needle 128K, the 31B leads at 66.4% with the 26B A4B at 44.1%, again showing a substantial step up over the smaller models.

Benchmark	31B	26B A4B	E4B	E2B
MMLU Pro	85.2%	82.6%	69.4%	60.0%
AIME 2026 no tools	89.2%	88.3%	42.5%	37.5%
LiveCodeBench v6	80.0%	77.1%	52.0%	44.0%
MMMU Pro	76.9%	73.8%	52.6%	44.2%
MATH-Vision	85.6%	82.4%	59.5%	52.4%
MRCR v2 8 needle 128k	66.4%	44.1%	25.4%	19.1%

These results do not tell you everything about product fit, but they do help explain the overall spec landscape. The small models are practical and multimodal, while the 31B and 26B A4B are where most of the flagship-style reasoning and coding strength lives.

License, Variants, and Ecosystem Fit

The official model card states that Gemma 4 is licensed under Apache 2.0 and authored by Google DeepMind. It also notes that the release includes open-weight models in both pre-trained and instruction-tuned variants. That combination matters a lot for developers and content creators. Apache 2.0 is a strong signal for broad usable openness, while the availability of both pre-trained and instruction-tuned variants means different workflows are supported from the start, from raw tuning and experimentation to direct application building.

Google’s docs also point to multiple ecosystem routes: Kaggle, Hugging Face, Transformers, Keras, PyTorch, Ollama, LM Studio, Google AI Edge, and the Gemini API for hosted access. This is important because a good specs page should not stop at raw architecture. It should also explain where the models fit. In Gemma 4’s case, the official ecosystem story is one of unusually broad deployment flexibility.

What the Specs Mean in Practice

If you are reading Gemma 4 specs as a product builder, the main decision is not “Which number is biggest?” It is “Which design best matches my workload?” E2B and E4B are small, long-context, multimodal, and designed for efficient execution. They are the most practical if you care about edge, mobile, or lightweight local deployments. The 31B is the dense flagship when you want raw capability inside a conventional dense design. The 26B A4B is the clever large-model option when you want strong reasoning and good throughput without paying the full dense-model cost at every inference step.

In other words, the Gemma 4 spec sheet is best read as a menu of deployment philosophies. The family is not trying to force everyone toward one huge universal model. Instead, it gives you several different shapes of capability and efficiency so you can pick the one that actually fits your use case.

Gemma 4 Specs FAQ

When was Gemma 4 released? Google’s release page lists Gemma 4 as released on March 31, 2026.
What models are in the Gemma 4 family? E2B, E4B, 31B, and 26B A4B.
What is the difference between E2B and E4B? Both are effective-parameter small models with 128K context and audio support, but E4B is larger at 4.5B effective parameters versus 2.3B for E2B.
Is Gemma 4 multimodal? Yes. All Gemma 4 models support text and image input, while E2B and E4B also support audio; Google also documents video capability on the smaller models.
What is the Gemma 4 context window? 128K on E2B and E4B, and 256K on 26B A4B and 31B.
What does 26B A4B mean? It is the MoE model with 25.2B total parameters and 3.8B active parameters during inference.
What is the license? The official model card lists Apache 2.0.
How many languages does Gemma 4 support? Google says Gemma 4 is pre-trained on 140+ languages with out-of-the-box support for 35+ languages.

💡 Best one-line takeaway

Gemma 4 specs are really about four different deployment shapes: two efficient small multimodal models, one large dense flagship, and one large MoE model optimized for faster high-level inference.

Final Take

Gemma 4 has one of the more interesting spec sheets among current open model families because it does not rely on a single scaling story. Instead, Google mixes effective-parameter efficiency, dense high-end capability, and MoE throughput efficiency inside one family. The result is a lineup that can target edge hardware, laptops, workstations, and more serious inference environments without pretending those are all the same deployment problem.

That is why the details matter. The layer counts, sliding windows, context lengths, modality support, active parameter counts, and benchmark categories are not just technical trivia. They explain what each Gemma 4 model is trying to be. And once you read the specs that way, the family becomes much easier to understand: E2B and E4B for efficient local multimodal use, 31B for dense flagship performance, and 26B A4B for a faster large-model MoE path.

📦 Model Specs 🏗️ Architecture 📊 Benchmarks ❓ FAQ

⚠️ Specs Note

This page is an informational summary based on current official Google AI for Developers documentation. Exact deployment behavior, memory needs, and supported runtime features can vary by framework, precision, and environment, so always confirm implementation details against the latest official docs.