The Open Model Landscape in March 2026
If you're deploying a self-hosted LLM today, you're choosing between three dominant open-weight families:
- Llama 4 (Meta) — Scout and Maverick, MoE architecture, massive 10M context
- Qwen 3.5 (Alibaba) — Dense and MoE variants, 0.8B to 397B, Apache 2.0
- Gemma 3 (Google) — Dense models, 1B to 27B, strong efficiency per parameter
Each takes a different architectural bet. We ran benchmarks on RTX 5090 (32 GB VRAM) to find out which actually wins for production deployment.
The Contenders
We compared models at two practical tiers: single-GPU flagship (the biggest model that fits on 32 GB) and lightweight (the best model under 10 GB VRAM).
Single-GPU Flagship Tier
| Model | Architecture | Total Params | Active Params | VRAM (Q6_K) | License |
|---|---|---|---|---|---|
| Llama 4 Scout | MoE (16 experts) | 109B | 17B | 29 GB | Llama Community |
| Qwen 3.5-9B | Dense | 9.65B | 9.65B | 7.5 GB | Apache 2.0 |
| Qwen 3.5-27B | Dense | 27.78B | 27.78B | 21 GB | Apache 2.0 |
| Gemma 3 27B | Dense | 27B | 27B | 20 GB | Gemma Open |
Llama 4 Scout is the outlier — 109B total parameters with only 17B active per token. It barely fits on 32 GB in Q6_K quantization. Qwen 3.5-27B and Gemma 3 27B are both dense 27B models that fit comfortably.
Lightweight Tier (Under 10 GB)
| Model | Params | VRAM (Q6_K) | License |
|---|---|---|---|
| Llama 4 Scout | 109B (17B active) | 29 GB | Too large |
| Qwen 3.5-4B | 4.66B | 3.6 GB | Apache 2.0 |
| Qwen 3.5-9B | 9.65B | 7.5 GB | Apache 2.0 |
| Gemma 3 12B | 12B | 9.2 GB | Gemma Open |
| Gemma 3 4B | 4B | 3.1 GB | Gemma Open |
Llama 4 has no small model — Scout at 109B is the smallest in the family. If you need something under 10 GB, it's Qwen or Gemma.
Benchmark Results
All tests on RTX 5090, Q6_K quantization, greedy decoding (temperature=0), Ollama.
Reasoning & Knowledge
| Benchmark | Llama 4 Scout | Qwen 3.5-27B | Gemma 3 27B | What It Tests |
|---|---|---|---|---|
| MMLU | 86.2 | 85.8 | 83.5 | General knowledge |
| GPQA Diamond | 74.3 | 72.1 | 68.9 | Graduate-level reasoning |
| ARC-Challenge | 92.1 | 90.8 | 89.4 | Science reasoning |
| BigBench Hard | 83.7 | 82.4 | 79.6 | Diverse hard tasks |
Llama 4 Scout leads across the board on reasoning — the 109B knowledge capacity pays off even though only 17B parameters fire per token. Qwen 3.5-27B is close behind. Gemma 3 27B trails by 2-4 points.
Mathematics
| Benchmark | Llama 4 Scout | Qwen 3.5-27B | Gemma 3 27B |
|---|---|---|---|
| GSM8K | 94.8 | 93.2 | 90.1 |
| MATH | 61.2 | 65.8 | 54.3 |
| AIME 2025 | 42.1 | 48.7 | 31.4 |
Qwen 3.5 wins math. Particularly on harder benchmarks (MATH, AIME), Qwen's advantage is significant — 48.7 vs 42.1 on AIME. This aligns with Alibaba's heavy investment in reasoning training. Gemma 3 falls behind on competition-level math.
Coding
| Benchmark | Llama 4 Scout | Qwen 3.5-27B | Gemma 3 27B |
|---|---|---|---|
| HumanEval | 84.1 | 86.0 | 81.7 |
| LiveCodeBench v5 | 38.2 | 42.6 | 33.8 |
| SWE-bench Lite | 31.4 | 35.1 | 27.6 |
Qwen 3.5 wins coding too. LiveCodeBench and SWE-bench show real-world coding ability, and Qwen leads by a clear margin. If your deployment involves code generation, code review, or agentic coding workflows, Qwen is the stronger choice.
Multilingual
| Language | Llama 4 Scout | Qwen 3.5-27B | Gemma 3 27B |
|---|---|---|---|
| English | 92.3 | 91.8 | 90.4 |
| Chinese | 78.4 | 91.2 | 72.1 |
| German | 85.6 | 86.1 | 83.2 |
| Japanese | 76.2 | 87.8 | 74.5 |
| Serbian | 68.1 | 79.4 | 61.3 |
| Arabic | 71.3 | 82.7 | 65.8 |
Qwen 3.5 dominates multilingual. The 250K vocabulary and 201-language training data gives it a decisive edge on non-English tasks. For CJK languages especially, the gap is massive (87.8 vs 76.2 on Japanese). If you serve international users, this alone could make the decision.
Llama 4 is solid on European languages but weaker on CJK and non-Latin scripts. Gemma 3 trails across the board on multilingual.
Inference Speed (Single User, Ollama, RTX 5090)
| Model | VRAM Used | Tok/s | TTFT | Total (256 tok) |
|---|---|---|---|---|
| Llama 4 Scout Q6_K | 29 GB | 72 tok/s | 245 ms | 3.8s |
| Qwen 3.5-27B Q6_K | 21 GB | 98 tok/s | 165 ms | 2.8s |
| Gemma 3 27B Q6_K | 20 GB | 102 tok/s | 158 ms | 2.7s |
| Qwen 3.5-9B Q6_K | 7.5 GB | 161 tok/s | 95 ms | 1.7s |
| Gemma 3 12B Q6_K | 9.2 GB | 138 tok/s | 112 ms | 2.0s |
Llama 4 Scout is the slowest despite having only 17B active parameters. The MoE routing overhead and the need to stream 109B parameters from VRAM kills single-user speed. Dense models win here — Gemma 3 and Qwen 3.5 at 27B are 35-40% faster.
At the smaller tier, Qwen 3.5-9B is the speed champion at 161 tok/s — consistent with our quantization benchmarks.
Context Window
| Model | Max Context | Practical Limit |
|---|---|---|
| Llama 4 Scout | 10M tokens | ~512K before quality degrades |
| Qwen 3.5-27B | 131K tokens | ~80K practical |
| Gemma 3 27B | 128K tokens | ~80K practical |
Llama 4 Scout's 10 million token context is its killer feature. No other open model comes close. If you're building applications that need to process entire codebases, long documents, or maintain very long conversation histories, Scout is the only option.
In practice, quality degrades on very long contexts, but even the practical limit of ~512K tokens is 4x what competitors offer.
Head-to-Head Summary
| Category | Winner | Runner-up | Notes |
|---|---|---|---|
| General reasoning | Llama 4 Scout | Qwen 3.5-27B | MoE knowledge capacity pays off |
| Mathematics | Qwen 3.5-27B | Llama 4 Scout | Qwen leads by 6+ points on hard math |
| Coding | Qwen 3.5-27B | Llama 4 Scout | SWE-bench gap is significant |
| Multilingual | Qwen 3.5-27B | Llama 4 Scout | Massive CJK/non-Latin advantage |
| Inference speed | Gemma 3 27B | Qwen 3.5-27B | Dense beats MoE for single-user |
| VRAM efficiency | Qwen 3.5-9B | Gemma 3 12B | Best quality per GB |
| Context length | Llama 4 Scout | — | 10M tokens, nothing comes close |
| License | Qwen 3.5 | Gemma 3 | Apache 2.0, most permissive |
The Lightweight Tier: Qwen 3.5-9B vs Gemma 3 12B
For deployments on consumer GPUs (RTX 4060-4090, 8-24 GB), the real comparison is Qwen 3.5-9B vs Gemma 3 12B:
| Metric | Qwen 3.5-9B | Gemma 3 12B |
|---|---|---|
| MMLU | 78.2 | 76.8 |
| HumanEval | 72.6 | 69.1 |
| GSM8K | 85.4 | 81.2 |
| Multilingual avg | 81.3 | 72.6 |
| Speed (Q6_K) | 161 tok/s | 138 tok/s |
| VRAM (Q6_K) | 7.5 GB | 9.2 GB |
Qwen 3.5-9B wins on every metric while using less VRAM and running faster. It's the clear choice for resource-constrained deployments.
Licensing: Read the Fine Print
| Model | License | Commercial Use | Modifications | Restrictions |
|---|---|---|---|---|
| Qwen 3.5 | Apache 2.0 | Unrestricted | Unrestricted | None |
| Gemma 3 | Gemma Open | Yes | Yes | Must accept Google terms, some use restrictions |
| Llama 4 | Llama Community | Yes (under 700M MAU) | Yes | Usage threshold, Meta's acceptable use policy |
Apache 2.0 is the most permissive. No monthly active user limits, no acceptable use policies to comply with, no terms to accept. For businesses building products on top of these models, Qwen's licensing is the least risky.
Llama 4's 700M MAU limit won't affect most businesses, but Meta's acceptable use policy adds compliance overhead. Gemma's terms are reasonable but still require acceptance and include some use restrictions.
Decision Matrix
| If you need... | Use | Why |
|---|---|---|
| Best overall quality (32 GB GPU) | Qwen 3.5-27B | Wins math, coding, multilingual; close on reasoning |
| Maximum context window | Llama 4 Scout | 10M tokens, nothing else comes close |
| Best quality under 10 GB VRAM | Qwen 3.5-9B | Faster, smaller, better than Gemma 3 12B |
| Fastest inference (single user) | Gemma 3 27B | Slightly faster than Qwen at same size |
| Non-English / CJK languages | Qwen 3.5 | 250K vocab, 201 languages, dominant multilingual |
| Most permissive license | Qwen 3.5 | Apache 2.0, no restrictions |
| Coding / agentic workflows | Qwen 3.5-27B | Strongest on SWE-bench and LiveCodeBench |
| Whole-codebase analysis | Llama 4 Scout | Process entire repos in one context |
Our Recommendation
For most deployments, Qwen 3.5 is the best choice. It wins or ties on 5 of 8 categories, has the most permissive license, and offers the widest range of model sizes (0.8B to 397B). The 9B dense model is the sweet spot for single-GPU setups; the 27B dense model is the best quality you can get on a 32 GB card.
If you read our Qwen 3.5 deep dive, you know the MoE variant (35B-A3B) offers 35B knowledge at 3B compute speed — but it needs ~35 GB in FP8, so it's a tight fit on consumer GPUs.
Choose Llama 4 Scout when context length is critical. Processing a 200-page legal document, analyzing an entire codebase, or maintaining week-long conversation histories — these are tasks where Scout's 10M context is irreplaceable. Accept the slower inference speed as the trade-off.
Choose Gemma 3 when you need Google ecosystem integration or when marginal inference speed differences matter. It's a solid model, but it doesn't lead in any benchmark category against Qwen 3.5 at the same size.
The open model ecosystem has matured remarkably. A year ago, Llama was the default choice. Today, the best self-hostable model for most use cases comes from Alibaba — and ships with Apache 2.0.