Research 10 min read

Gemma 4 vs Qwen 3.5 vs Llama 4: Updated Benchmarks, New Leader

ai.rs Apr 2, 2026

One Month Later, Everything Changed

In early March, we published a head-to-head comparison of Llama 4, Qwen 3.5, and Gemma 3. The conclusion was clear: Gemma 3 finished last in every category except raw inference speed. Qwen 3.5 won math, coding, and multilingual. Llama 4 Scout won reasoning and context length. Gemma 3 was the also-ran.

That article is now outdated.

Google just released Gemma 4 — four model sizes, a new MoE architecture, multimodal audio support, thinking mode, and benchmark scores that make Gemma 3's numbers look like a different era. The jump isn't incremental. It's the largest single-generation improvement we've seen in the open model space.

The Gemma 4 Family

Four models, two architectures, spanning edge devices to full GPUs:

Model Architecture Total Params Active Params Context Modalities
Gemma 4 E2B Dense 5.1B 2.3B 128K Text, Image, Audio, Video
Gemma 4 E4B Dense 8B 4.5B 128K Text, Image, Audio, Video
Gemma 4 26B-A4B MoE (128 experts) 25.2B 3.8B 256K Text, Image, Video
Gemma 4 31B Dense 30.7B 30.7B 256K Text, Image, Video

The naming convention: E prefix means edge-optimized, A means active parameters in the MoE variant. So "26B-A4B" = 26B total, 4B active per token.

The standout is the 26B-A4B. It uses 128 small experts with 8 active per token plus one shared always-on expert. This is a different design philosophy from Llama 4 Scout's 16 large experts — Google bet on many small experts rather than fewer large ones.

The Numbers: Gemma 3 vs Gemma 4

These comparisons use the same benchmarks, same evaluation conditions. The improvements are not subtle.

Reasoning & Knowledge

Benchmark Gemma 3 27B Gemma 4 31B Gemma 4 26B-A4B Change (31B)
MMLU Pro 67.6% 85.2% 82.6% +17.6 pts
GPQA Diamond 42.4% 84.3% 82.3% +41.9 pts
BigBench Extra Hard 19.3% 74.4% 64.8% +55.1 pts
MMMLU (multilingual) 70.7% 88.4% 86.3% +17.7 pts

GPQA Diamond — graduate-level reasoning — nearly doubled. BigBench Extra Hard went from 19% to 74%. These aren't incremental gains. Gemma 3 was struggling with hard reasoning; Gemma 4 handles it.

Mathematics

Benchmark Gemma 3 27B Gemma 4 31B Gemma 4 26B-A4B Change (31B)
AIME 2026 20.8% 89.2% 88.3% +68.4 pts

From 20.8% to 89.2% on competition math. This is the single most dramatic benchmark improvement in the table. For context, in our March comparison, Qwen 3.5-27B scored 48.7% on AIME 2025 and was the math leader. Gemma 4 nearly doubles that.

The thinking mode — where the model reasons step-by-step before answering — is likely driving this. When Gemma 4 "thinks," it can produce 4,000+ tokens of reasoning before committing to an answer.

Coding

Benchmark Gemma 3 27B Gemma 4 31B Gemma 4 26B-A4B Change (31B)
LiveCodeBench v6 29.1% 80.0% 77.1% +50.9 pts
Codeforces ELO 110 2150 1718 +2040 pts

Codeforces ELO went from 110 (barely functional) to 2150 (expert competitive programmer). LiveCodeBench nearly tripled. The coding gap between Gemma and the competition didn't just close — it reversed.

Vision

Benchmark Gemma 3 27B Gemma 4 31B Gemma 4 26B-A4B
MMMU Pro 49.7% 76.9% 73.8%
MATH-Vision 46.0% 85.6% 82.4%

Vision understanding saw similar jumps. MATH-Vision — solving math problems from images — nearly doubled. The model now handles charts, diagrams, and handwritten equations significantly better.

Long Context

Benchmark Gemma 3 27B Gemma 4 31B Gemma 4 26B-A4B
MRCR v2 (128K avg) 13.5% 66.4% 44.1%

Gemma 3's 128K context was mostly theoretical — it could accept long inputs but couldn't reliably use information from them. Gemma 4 at 256K context actually retrieves and reasons over long documents. The 31B model went from 13.5% to 66.4% on multi-needle retrieval tests.

The MoE Efficiency Story

The 26B-A4B deserves special attention. Look at these numbers again:

Benchmark Gemma 4 31B (30.7B active) Gemma 4 26B-A4B (3.8B active)
MMLU Pro 85.2% 82.6%
AIME 2026 89.2% 88.3%
LiveCodeBench v6 80.0% 77.1%
GPQA Diamond 84.3% 82.3%
LMArena Score ~1452 ~1441

The MoE variant achieves 97% of the dense model's quality while activating only 3.8B parameters per token instead of 30.7B. That's 8x less compute per inference step.

For deployment, this means:

  • Much less VRAM needed for KV cache at long contexts
  • Faster inference — fewer parameters to compute per token
  • Lower cost per query in production

Google's choice of 128 small experts (vs Llama 4's 16 large experts) appears to work. The LMArena score of 1441 with only 4B active params is remarkable — it's competitive with models 8x its active size.

How Gemma 4 Reshapes Our Comparison

Our March rankings put Qwen 3.5 first, Llama 4 second, Gemma 3 third. Here's how Gemma 4 changes each category:

Category March Winner Updated Assessment
General reasoning Llama 4 Scout Gemma 4 31B takes the lead (84.3% GPQA vs Scout's 74.3%)
Mathematics Qwen 3.5-27B Gemma 4 dominates (89.2% AIME, well ahead of Qwen's ~49%)
Coding Qwen 3.5-27B Gemma 4 dominates (80.0% LiveCodeBench vs Qwen's ~43%)
Multilingual Qwen 3.5-27B Likely still Qwen (250K vocab, 201 languages vs Gemma's 140)
Inference speed Gemma 3 27B TBD — need to benchmark Gemma 4 31B on same hardware
Context length Llama 4 Scout (10M) Still Llama 4 (10M vs 256K), but Gemma 4 actually uses its context
License Qwen 3.5 (Apache 2.0) Tie — Gemma 4 is now Apache 2.0 too
VRAM efficiency Qwen 3.5-9B Gemma 4 26B-A4B is the new efficiency king

Note on benchmark versions: Our March tests used AIME 2025, LiveCodeBench v5, and standard MMLU. Gemma 4's reported scores use AIME 2026, LiveCodeBench v6, and MMLU Pro. Direct numerical comparison across versions should be taken as directional, not exact. The Gemma 3 → Gemma 4 comparisons above use identical benchmark versions.

The Apache 2.0 Switch

Gemma 3 shipped with the "Gemma Open" license — commercial use allowed but with Google-specific terms and restrictions. In our March comparison, we flagged this as a disadvantage against Qwen 3.5's Apache 2.0.

Gemma 4 switches to Apache 2.0. No usage restrictions, no MAU limits, no acceptable use policies. The same license as Qwen 3.5.

This removes one of the last arguments against Gemma. For businesses building products on open models, the licensing playing field is now level between Gemma 4 and Qwen 3.5. Llama 4's community license (700M MAU limit + Meta's acceptable use policy) is now the most restrictive of the three families.

What's New Beyond Benchmarks

Thinking Mode

Gemma 4 supports extended reasoning — the model produces a chain-of-thought before answering, similar to DeepSeek-R1 or OpenAI o1. This is what drives the massive math and reasoning improvements. The thinking can run to 4,000+ tokens, giving the model space to break problems down, try approaches, and verify its work.

Multimodal Audio

The smaller models (E2B, E4B) support audio input — speech transcription and audio Q&A. The larger models (26B-A4B, 31B) handle image and video but not audio. This is an unusual split: the edge models are more multimodal than the flagship.

Native Function Calling

All models support structured function calling out of the box — returning JSON with tool calls without special prompting. Combined with the thinking mode, this makes Gemma 4 a strong candidate for agentic workflows where the model needs to reason about which tools to call and in what order.

Per-Layer Embeddings (PLE)

A novel architecture feature: a second embedding table feeds residual signals into every decoder layer, giving each layer a token-identity component tailored to that specific layer's role. This is a quiet innovation that likely contributes to the quality improvements across the board.

Shared KV Cache

The last several decoder layers share key-value tensors, reducing memory usage during long-context inference with minimal quality impact. Combined with the 256K context window, this makes Gemma 4 practical for long-document workflows where Gemma 3 was only theoretical.

Updated Decision Matrix

If you need... Use Why
Best overall quality (32 GB GPU) Gemma 4 31B Leads reasoning, math, coding, vision
Best quality per compute Gemma 4 26B-A4B 97% of 31B quality at 8x less compute
Maximum context window Llama 4 Scout Still 10M tokens, unmatched
Best multilingual Qwen 3.5-27B 250K vocab, 201 languages
Best under 10 GB VRAM Gemma 4 E4B or Qwen 3.5-9B Both strong; benchmark head-to-head needed
Edge / mobile deployment Gemma 4 E2B 2.3B active, audio support, 128K context
Most permissive license Gemma 4 or Qwen 3.5 Both Apache 2.0
Audio understanding Gemma 4 E4B Only open model family with native audio
Agentic workflows Gemma 4 31B Thinking mode + native function calling

What We Still Need to Test

We haven't run Gemma 4 on our RTX 5090 benchmark suite yet. Key unknowns:

  • Actual inference speed — the 31B dense model should be comparable to Gemma 3 27B in tok/s, but the MoE 26B-A4B is the interesting question. With 128 experts and 3.8B active params, it could be very fast
  • VRAM usage with quantization — Q6_K and Q4_K_M sizes for each variant
  • Real-world multilingual performance — Gemma claims 140 languages, but Qwen's 201-language, 250K-vocabulary advantage may still hold for CJK and non-Latin scripts
  • Thinking mode overhead — how much slower is inference when the model reasons for 4,000 tokens before answering?

We'll publish a full hands-on benchmark when we've run the tests. For now, Google's reported numbers are strong enough to change the recommendation.

The Bottom Line

A month ago we wrote: "Choose Gemma 3 when you need Google ecosystem integration or when marginal inference speed differences matter. It's a solid model, but it doesn't lead in any benchmark category."

That's no longer true. Gemma 4 leads in reasoning, math, coding, and vision. The 26B-A4B MoE variant offers the best quality-per-compute ratio in the open model space. The license is now Apache 2.0. The context window works.

The open model race just got a new leader. Qwen 3.5 still holds the multilingual crown, and Llama 4 Scout still has the unmatched 10M context window. But for overall quality, especially on hard reasoning and coding tasks, Gemma 4 is the model to beat.

The ball is now in Alibaba's and Meta's court.


This article is a follow-up to Llama 4 vs Qwen 3.5 vs Gemma 3: Which Open Model Should You Deploy?, published March 6, 2026.

Share: Post Share

Related Articles