One Month Later, Everything Changed
In early March, we published a head-to-head comparison of Llama 4, Qwen 3.5, and Gemma 3. The conclusion was clear: Gemma 3 finished last in every category except raw inference speed. Qwen 3.5 won math, coding, and multilingual. Llama 4 Scout won reasoning and context length. Gemma 3 was the also-ran.
That article is now outdated.
Google just released Gemma 4 — four model sizes, a new MoE architecture, multimodal audio support, thinking mode, and benchmark scores that make Gemma 3's numbers look like a different era. The jump isn't incremental. It's the largest single-generation improvement we've seen in the open model space.
The Gemma 4 Family
Four models, two architectures, spanning edge devices to full GPUs:
| Model | Architecture | Total Params | Active Params | Context | Modalities |
|---|---|---|---|---|---|
| Gemma 4 E2B | Dense | 5.1B | 2.3B | 128K | Text, Image, Audio, Video |
| Gemma 4 E4B | Dense | 8B | 4.5B | 128K | Text, Image, Audio, Video |
| Gemma 4 26B-A4B | MoE (128 experts) | 25.2B | 3.8B | 256K | Text, Image, Video |
| Gemma 4 31B | Dense | 30.7B | 30.7B | 256K | Text, Image, Video |
The naming convention: E prefix means edge-optimized, A means active parameters in the MoE variant. So "26B-A4B" = 26B total, 4B active per token.
The standout is the 26B-A4B. It uses 128 small experts with 8 active per token plus one shared always-on expert. This is a different design philosophy from Llama 4 Scout's 16 large experts — Google bet on many small experts rather than fewer large ones.
The Numbers: Gemma 3 vs Gemma 4
These comparisons use the same benchmarks, same evaluation conditions. The improvements are not subtle.
Reasoning & Knowledge
| Benchmark | Gemma 3 27B | Gemma 4 31B | Gemma 4 26B-A4B | Change (31B) |
|---|---|---|---|---|
| MMLU Pro | 67.6% | 85.2% | 82.6% | +17.6 pts |
| GPQA Diamond | 42.4% | 84.3% | 82.3% | +41.9 pts |
| BigBench Extra Hard | 19.3% | 74.4% | 64.8% | +55.1 pts |
| MMMLU (multilingual) | 70.7% | 88.4% | 86.3% | +17.7 pts |
GPQA Diamond — graduate-level reasoning — nearly doubled. BigBench Extra Hard went from 19% to 74%. These aren't incremental gains. Gemma 3 was struggling with hard reasoning; Gemma 4 handles it.
Mathematics
| Benchmark | Gemma 3 27B | Gemma 4 31B | Gemma 4 26B-A4B | Change (31B) |
|---|---|---|---|---|
| AIME 2026 | 20.8% | 89.2% | 88.3% | +68.4 pts |
From 20.8% to 89.2% on competition math. This is the single most dramatic benchmark improvement in the table. For context, in our March comparison, Qwen 3.5-27B scored 48.7% on AIME 2025 and was the math leader. Gemma 4 nearly doubles that.
The thinking mode — where the model reasons step-by-step before answering — is likely driving this. When Gemma 4 "thinks," it can produce 4,000+ tokens of reasoning before committing to an answer.
Coding
| Benchmark | Gemma 3 27B | Gemma 4 31B | Gemma 4 26B-A4B | Change (31B) |
|---|---|---|---|---|
| LiveCodeBench v6 | 29.1% | 80.0% | 77.1% | +50.9 pts |
| Codeforces ELO | 110 | 2150 | 1718 | +2040 pts |
Codeforces ELO went from 110 (barely functional) to 2150 (expert competitive programmer). LiveCodeBench nearly tripled. The coding gap between Gemma and the competition didn't just close — it reversed.
Vision
| Benchmark | Gemma 3 27B | Gemma 4 31B | Gemma 4 26B-A4B |
|---|---|---|---|
| MMMU Pro | 49.7% | 76.9% | 73.8% |
| MATH-Vision | 46.0% | 85.6% | 82.4% |
Vision understanding saw similar jumps. MATH-Vision — solving math problems from images — nearly doubled. The model now handles charts, diagrams, and handwritten equations significantly better.
Long Context
| Benchmark | Gemma 3 27B | Gemma 4 31B | Gemma 4 26B-A4B |
|---|---|---|---|
| MRCR v2 (128K avg) | 13.5% | 66.4% | 44.1% |
Gemma 3's 128K context was mostly theoretical — it could accept long inputs but couldn't reliably use information from them. Gemma 4 at 256K context actually retrieves and reasons over long documents. The 31B model went from 13.5% to 66.4% on multi-needle retrieval tests.
The MoE Efficiency Story
The 26B-A4B deserves special attention. Look at these numbers again:
| Benchmark | Gemma 4 31B (30.7B active) | Gemma 4 26B-A4B (3.8B active) |
|---|---|---|
| MMLU Pro | 85.2% | 82.6% |
| AIME 2026 | 89.2% | 88.3% |
| LiveCodeBench v6 | 80.0% | 77.1% |
| GPQA Diamond | 84.3% | 82.3% |
| LMArena Score | ~1452 | ~1441 |
The MoE variant achieves 97% of the dense model's quality while activating only 3.8B parameters per token instead of 30.7B. That's 8x less compute per inference step.
For deployment, this means:
- Much less VRAM needed for KV cache at long contexts
- Faster inference — fewer parameters to compute per token
- Lower cost per query in production
Google's choice of 128 small experts (vs Llama 4's 16 large experts) appears to work. The LMArena score of 1441 with only 4B active params is remarkable — it's competitive with models 8x its active size.
How Gemma 4 Reshapes Our Comparison
Our March rankings put Qwen 3.5 first, Llama 4 second, Gemma 3 third. Here's how Gemma 4 changes each category:
| Category | March Winner | Updated Assessment |
|---|---|---|
| General reasoning | Llama 4 Scout | Gemma 4 31B takes the lead (84.3% GPQA vs Scout's 74.3%) |
| Mathematics | Qwen 3.5-27B | Gemma 4 dominates (89.2% AIME, well ahead of Qwen's ~49%) |
| Coding | Qwen 3.5-27B | Gemma 4 dominates (80.0% LiveCodeBench vs Qwen's ~43%) |
| Multilingual | Qwen 3.5-27B | Likely still Qwen (250K vocab, 201 languages vs Gemma's 140) |
| Inference speed | Gemma 3 27B | TBD — need to benchmark Gemma 4 31B on same hardware |
| Context length | Llama 4 Scout (10M) | Still Llama 4 (10M vs 256K), but Gemma 4 actually uses its context |
| License | Qwen 3.5 (Apache 2.0) | Tie — Gemma 4 is now Apache 2.0 too |
| VRAM efficiency | Qwen 3.5-9B | Gemma 4 26B-A4B is the new efficiency king |
Note on benchmark versions: Our March tests used AIME 2025, LiveCodeBench v5, and standard MMLU. Gemma 4's reported scores use AIME 2026, LiveCodeBench v6, and MMLU Pro. Direct numerical comparison across versions should be taken as directional, not exact. The Gemma 3 → Gemma 4 comparisons above use identical benchmark versions.
The Apache 2.0 Switch
Gemma 3 shipped with the "Gemma Open" license — commercial use allowed but with Google-specific terms and restrictions. In our March comparison, we flagged this as a disadvantage against Qwen 3.5's Apache 2.0.
Gemma 4 switches to Apache 2.0. No usage restrictions, no MAU limits, no acceptable use policies. The same license as Qwen 3.5.
This removes one of the last arguments against Gemma. For businesses building products on open models, the licensing playing field is now level between Gemma 4 and Qwen 3.5. Llama 4's community license (700M MAU limit + Meta's acceptable use policy) is now the most restrictive of the three families.
What's New Beyond Benchmarks
Thinking Mode
Gemma 4 supports extended reasoning — the model produces a chain-of-thought before answering, similar to DeepSeek-R1 or OpenAI o1. This is what drives the massive math and reasoning improvements. The thinking can run to 4,000+ tokens, giving the model space to break problems down, try approaches, and verify its work.
Multimodal Audio
The smaller models (E2B, E4B) support audio input — speech transcription and audio Q&A. The larger models (26B-A4B, 31B) handle image and video but not audio. This is an unusual split: the edge models are more multimodal than the flagship.
Native Function Calling
All models support structured function calling out of the box — returning JSON with tool calls without special prompting. Combined with the thinking mode, this makes Gemma 4 a strong candidate for agentic workflows where the model needs to reason about which tools to call and in what order.
Per-Layer Embeddings (PLE)
A novel architecture feature: a second embedding table feeds residual signals into every decoder layer, giving each layer a token-identity component tailored to that specific layer's role. This is a quiet innovation that likely contributes to the quality improvements across the board.
Shared KV Cache
The last several decoder layers share key-value tensors, reducing memory usage during long-context inference with minimal quality impact. Combined with the 256K context window, this makes Gemma 4 practical for long-document workflows where Gemma 3 was only theoretical.
Updated Decision Matrix
| If you need... | Use | Why |
|---|---|---|
| Best overall quality (32 GB GPU) | Gemma 4 31B | Leads reasoning, math, coding, vision |
| Best quality per compute | Gemma 4 26B-A4B | 97% of 31B quality at 8x less compute |
| Maximum context window | Llama 4 Scout | Still 10M tokens, unmatched |
| Best multilingual | Qwen 3.5-27B | 250K vocab, 201 languages |
| Best under 10 GB VRAM | Gemma 4 E4B or Qwen 3.5-9B | Both strong; benchmark head-to-head needed |
| Edge / mobile deployment | Gemma 4 E2B | 2.3B active, audio support, 128K context |
| Most permissive license | Gemma 4 or Qwen 3.5 | Both Apache 2.0 |
| Audio understanding | Gemma 4 E4B | Only open model family with native audio |
| Agentic workflows | Gemma 4 31B | Thinking mode + native function calling |
What We Still Need to Test
We haven't run Gemma 4 on our RTX 5090 benchmark suite yet. Key unknowns:
- Actual inference speed — the 31B dense model should be comparable to Gemma 3 27B in tok/s, but the MoE 26B-A4B is the interesting question. With 128 experts and 3.8B active params, it could be very fast
- VRAM usage with quantization — Q6_K and Q4_K_M sizes for each variant
- Real-world multilingual performance — Gemma claims 140 languages, but Qwen's 201-language, 250K-vocabulary advantage may still hold for CJK and non-Latin scripts
- Thinking mode overhead — how much slower is inference when the model reasons for 4,000 tokens before answering?
We'll publish a full hands-on benchmark when we've run the tests. For now, Google's reported numbers are strong enough to change the recommendation.
The Bottom Line
A month ago we wrote: "Choose Gemma 3 when you need Google ecosystem integration or when marginal inference speed differences matter. It's a solid model, but it doesn't lead in any benchmark category."
That's no longer true. Gemma 4 leads in reasoning, math, coding, and vision. The 26B-A4B MoE variant offers the best quality-per-compute ratio in the open model space. The license is now Apache 2.0. The context window works.
The open model race just got a new leader. Qwen 3.5 still holds the multilingual crown, and Llama 4 Scout still has the unmatched 10M context window. But for overall quality, especially on hard reasoning and coding tasks, Gemma 4 is the model to beat.
The ball is now in Alibaba's and Meta's court.
This article is a follow-up to Llama 4 vs Qwen 3.5 vs Gemma 3: Which Open Model Should You Deploy?, published March 6, 2026.