Research 10 min read

Gemma 4 vs Qwen 3.5 vs Llama 4: Updated Benchmarks, New Leader

ai.rs Apr 2, 2026

gemma google benchmarks moe open-models comparison

One Month Later, Everything Changed

In early March, we published a head-to-head comparison of Llama 4, Qwen 3.5, and Gemma 3. The conclusion was clear: Gemma 3 finished last in every category except raw inference speed. Qwen 3.5 won math, coding, and multilingual. Llama 4 Scout won reasoning and context length. Gemma 3 was the also-ran.

That article is now outdated.

Google just released Gemma 4 — four model sizes, a new MoE architecture, multimodal audio support, thinking mode, and benchmark scores that make Gemma 3's numbers look like a different era. The jump isn't incremental. It's the largest single-generation improvement we've seen in the open model space.

The Gemma 4 Family

Four models, two architectures, spanning edge devices to full GPUs:

Model	Architecture	Total Params	Active Params	Context	Modalities
Gemma 4 E2B	Dense	5.1B	2.3B	128K	Text, Image, Audio, Video
Gemma 4 E4B	Dense	8B	4.5B	128K	Text, Image, Audio, Video
Gemma 4 26B-A4B	MoE (128 experts)	25.2B	3.8B	256K	Text, Image, Video
Gemma 4 31B	Dense	30.7B	30.7B	256K	Text, Image, Video

The naming convention: E prefix means edge-optimized, A means active parameters in the MoE variant. So "26B-A4B" = 26B total, 4B active per token.

The standout is the 26B-A4B. It uses 128 small experts with 8 active per token plus one shared always-on expert. This is a different design philosophy from Llama 4 Scout's 16 large experts — Google bet on many small experts rather than fewer large ones.

The Numbers: Gemma 3 vs Gemma 4

These comparisons use the same benchmarks, same evaluation conditions. The improvements are not subtle.

Reasoning & Knowledge

Benchmark	Gemma 3 27B	Gemma 4 31B	Gemma 4 26B-A4B	Change (31B)
MMLU Pro	67.6%	85.2%	82.6%	+17.6 pts
GPQA Diamond	42.4%	84.3%	82.3%	+41.9 pts
BigBench Extra Hard	19.3%	74.4%	64.8%	+55.1 pts
MMMLU (multilingual)	70.7%	88.4%	86.3%	+17.7 pts

GPQA Diamond — graduate-level reasoning — nearly doubled. BigBench Extra Hard went from 19% to 74%. These aren't incremental gains. Gemma 3 was struggling with hard reasoning; Gemma 4 handles it.

Mathematics

Benchmark	Gemma 3 27B	Gemma 4 31B	Gemma 4 26B-A4B	Change (31B)
AIME 2026	20.8%	89.2%	88.3%	+68.4 pts

From 20.8% to 89.2% on competition math. This is the single most dramatic benchmark improvement in the table. For context, in our March comparison, Qwen 3.5-27B scored 48.7% on AIME 2025 and was the math leader. Gemma 4 nearly doubles that.

The thinking mode — where the model reasons step-by-step before answering — is likely driving this. When Gemma 4 "thinks," it can produce 4,000+ tokens of reasoning before committing to an answer.

Coding

Benchmark	Gemma 3 27B	Gemma 4 31B	Gemma 4 26B-A4B	Change (31B)
LiveCodeBench v6	29.1%	80.0%	77.1%	+50.9 pts
Codeforces ELO	110	2150	1718	+2040 pts

Codeforces ELO went from 110 (barely functional) to 2150 (expert competitive programmer). LiveCodeBench nearly tripled. The coding gap between Gemma and the competition didn't just close — it reversed.

Vision

Benchmark	Gemma 3 27B	Gemma 4 31B	Gemma 4 26B-A4B
MMMU Pro	49.7%	76.9%	73.8%
MATH-Vision	46.0%	85.6%	82.4%

Vision understanding saw similar jumps. MATH-Vision — solving math problems from images — nearly doubled. The model now handles charts, diagrams, and handwritten equations significantly better.

Long Context

Benchmark	Gemma 3 27B	Gemma 4 31B	Gemma 4 26B-A4B
MRCR v2 (128K avg)	13.5%	66.4%	44.1%

Gemma 3's 128K context was mostly theoretical — it could accept long inputs but couldn't reliably use information from them. Gemma 4 at 256K context actually retrieves and reasons over long documents. The 31B model went from 13.5% to 66.4% on multi-needle retrieval tests.

The MoE Efficiency Story

The 26B-A4B deserves special attention. Look at these numbers again:

Benchmark	Gemma 4 31B (30.7B active)	Gemma 4 26B-A4B (3.8B active)
MMLU Pro	85.2%	82.6%
AIME 2026	89.2%	88.3%
LiveCodeBench v6	80.0%	77.1%
GPQA Diamond	84.3%	82.3%
LMArena Score	~1452	~1441

The MoE variant achieves 97% of the dense model's quality while activating only 3.8B parameters per token instead of 30.7B. That's 8x less compute per inference step.

For deployment, this means:

Much less VRAM needed for KV cache at long contexts
Faster inference — fewer parameters to compute per token
Lower cost per query in production

Google's choice of 128 small experts (vs Llama 4's 16 large experts) appears to work. The LMArena score of 1441 with only 4B active params is remarkable — it's competitive with models 8x its active size.

How Gemma 4 Reshapes Our Comparison

Our March rankings put Qwen 3.5 first, Llama 4 second, Gemma 3 third. Here's how Gemma 4 changes each category:

Category	March Winner	Updated Assessment
General reasoning	Llama 4 Scout	Gemma 4 31B takes the lead (84.3% GPQA vs Scout's 74.3%)
Mathematics	Qwen 3.5-27B	Gemma 4 dominates (89.2% AIME, well ahead of Qwen's ~49%)
Coding	Qwen 3.5-27B	Gemma 4 dominates (80.0% LiveCodeBench vs Qwen's ~43%)
Multilingual	Qwen 3.5-27B	Likely still Qwen (250K vocab, 201 languages vs Gemma's 140)
Inference speed	Gemma 3 27B	TBD — need to benchmark Gemma 4 31B on same hardware
Context length	Llama 4 Scout (10M)	Still Llama 4 (10M vs 256K), but Gemma 4 actually uses its context
License	Qwen 3.5 (Apache 2.0)	Tie — Gemma 4 is now Apache 2.0 too
VRAM efficiency	Qwen 3.5-9B	Gemma 4 26B-A4B is the new efficiency king

Note on benchmark versions: Our March tests used AIME 2025, LiveCodeBench v5, and standard MMLU. Gemma 4's reported scores use AIME 2026, LiveCodeBench v6, and MMLU Pro. Direct numerical comparison across versions should be taken as directional, not exact. The Gemma 3 → Gemma 4 comparisons above use identical benchmark versions.

The Apache 2.0 Switch

Gemma 3 shipped with the "Gemma Open" license — commercial use allowed but with Google-specific terms and restrictions. In our March comparison, we flagged this as a disadvantage against Qwen 3.5's Apache 2.0.

Gemma 4 switches to Apache 2.0. No usage restrictions, no MAU limits, no acceptable use policies. The same license as Qwen 3.5.

This removes one of the last arguments against Gemma. For businesses building products on open models, the licensing playing field is now level between Gemma 4 and Qwen 3.5. Llama 4's community license (700M MAU limit + Meta's acceptable use policy) is now the most restrictive of the three families.

What's New Beyond Benchmarks

Thinking Mode

Gemma 4 supports extended reasoning — the model produces a chain-of-thought before answering, similar to DeepSeek-R1 or OpenAI o1. This is what drives the massive math and reasoning improvements. The thinking can run to 4,000+ tokens, giving the model space to break problems down, try approaches, and verify its work.

Multimodal Audio

The smaller models (E2B, E4B) support audio input — speech transcription and audio Q&A. The larger models (26B-A4B, 31B) handle image and video but not audio. This is an unusual split: the edge models are more multimodal than the flagship.

Native Function Calling

All models support structured function calling out of the box — returning JSON with tool calls without special prompting. Combined with the thinking mode, this makes Gemma 4 a strong candidate for agentic workflows where the model needs to reason about which tools to call and in what order.

Per-Layer Embeddings (PLE)

A novel architecture feature: a second embedding table feeds residual signals into every decoder layer, giving each layer a token-identity component tailored to that specific layer's role. This is a quiet innovation that likely contributes to the quality improvements across the board.

Shared KV Cache

The last several decoder layers share key-value tensors, reducing memory usage during long-context inference with minimal quality impact. Combined with the 256K context window, this makes Gemma 4 practical for long-document workflows where Gemma 3 was only theoretical.

Updated Decision Matrix

If you need...	Use	Why
Best overall quality (32 GB GPU)	Gemma 4 31B	Leads reasoning, math, coding, vision
Best quality per compute	Gemma 4 26B-A4B	97% of 31B quality at 8x less compute
Maximum context window	Llama 4 Scout	Still 10M tokens, unmatched
Best multilingual	Qwen 3.5-27B	250K vocab, 201 languages
Best under 10 GB VRAM	Gemma 4 E4B or Qwen 3.5-9B	Both strong; benchmark head-to-head needed
Edge / mobile deployment	Gemma 4 E2B	2.3B active, audio support, 128K context
Most permissive license	Gemma 4 or Qwen 3.5	Both Apache 2.0
Audio understanding	Gemma 4 E4B	Only open model family with native audio
Agentic workflows	Gemma 4 31B	Thinking mode + native function calling

What We Still Need to Test

We haven't run Gemma 4 on our RTX 5090 benchmark suite yet. Key unknowns:

Actual inference speed — the 31B dense model should be comparable to Gemma 3 27B in tok/s, but the MoE 26B-A4B is the interesting question. With 128 experts and 3.8B active params, it could be very fast
VRAM usage with quantization — Q6_K and Q4_K_M sizes for each variant
Real-world multilingual performance — Gemma claims 140 languages, but Qwen's 201-language, 250K-vocabulary advantage may still hold for CJK and non-Latin scripts
Thinking mode overhead — how much slower is inference when the model reasons for 4,000 tokens before answering?

We'll publish a full hands-on benchmark when we've run the tests. For now, Google's reported numbers are strong enough to change the recommendation.

The Bottom Line

A month ago we wrote: "Choose Gemma 3 when you need Google ecosystem integration or when marginal inference speed differences matter. It's a solid model, but it doesn't lead in any benchmark category."

That's no longer true. Gemma 4 leads in reasoning, math, coding, and vision. The 26B-A4B MoE variant offers the best quality-per-compute ratio in the open model space. The license is now Apache 2.0. The context window works.

The open model race just got a new leader. Qwen 3.5 still holds the multilingual crown, and Llama 4 Scout still has the unmatched 10M context window. But for overall quality, especially on hard reasoning and coding tasks, Gemma 4 is the model to beat.

The ball is now in Alibaba's and Meta's court.

This article is a follow-up to Llama 4 vs Qwen 3.5 vs Gemma 3: Which Open Model Should You Deploy?, published March 6, 2026.

Want to fine-tune Gemma 4 locally? Read Gemma 4 LoRA Fine-Tuning on RTX 5090: What Works and What Doesn't for our hands-on results.

Frequently Asked Questions

Is Gemma 4 better than Qwen 3.5? +

For reasoning, math, and coding, yes. Gemma 4 31B scores 89.2% on AIME 2026 versus Qwen 3.5-27B at roughly 49% on AIME 2025, 80.0% on LiveCodeBench v6 versus Qwen at roughly 43% on v5, and 84.3% on GPQA Diamond. Qwen 3.5 still leads in multilingual coverage with 201 languages and a 250K vocabulary against Gemma 4's 140 languages. Both are now Apache 2.0 licensed.

How does Gemma 4 compare to Llama 4? +

Gemma 4 31B takes the general-reasoning lead with 84.3% on GPQA Diamond versus Llama 4 Scout's 74.3%, and also wins on math, coding, and vision. Llama 4 Scout still holds the context-length crown at 10 million tokens versus Gemma 4's 256K, which matters for whole-codebase or long-document workloads. For most production deployments, Gemma 4's smaller footprint and Apache 2.0 license make it the better default.

Is Gemma 4 better at coding than Qwen 3.5? +

Significantly. Gemma 4 31B scores 80.0% on LiveCodeBench v6 — nearly double the previous open-model leaders. Its Codeforces ELO jumped from Gemma 3's 110 to 2150, which corresponds to an expert competitive programmer. The coding gap between Gemma and Qwen did not just close — it reversed.

What is the Gemma 4 26B-A4B MoE model? +

It is a Mixture-of-Experts variant with 25.2 billion total parameters but only 3.8 billion active per token. The architecture uses 128 small experts with 8 active per token plus one always-on shared expert — different from Llama 4 Scout's 16 large experts. The 26B-A4B achieves 97% of the 31B dense model's quality at roughly 8 times less compute per inference step, making it the new efficiency leader for production deployments.

What are the Gemma 4 benchmark scores? +

For Gemma 4 31B (the flagship): 89.2% on AIME 2026 for math, 80.0% on LiveCodeBench v6 for coding, 84.3% on GPQA Diamond for graduate-level reasoning, 85.2% on MMLU Pro for general knowledge, 76.9% on MMMU Pro for vision, and 66.4% on MRCR v2 for long-context retrieval. The Gemma 3 to Gemma 4 jump on these benchmarks is the largest single-generation improvement we have seen in the open model space.

What license is Gemma 4 released under? +

Apache 2.0 — no usage restrictions, no monthly active user limits, no acceptable use policy. This is a major change from Gemma 3's Gemma Open license, which had Google-specific restrictions. Gemma 4 is now equivalent to Qwen 3.5 on licensing, and significantly more permissive than Llama 4 Scout's community license, which has a 700M MAU limit and Meta's acceptable use policy.

Wondering if this fits your business?

The benchmarks and architectures we cover here power real production AI assistants. See where your business stands in 2 minutes.

Take the AI Readiness Check

Share: Post Share