Alibaba released Qwen 3.5 between February 16 and March 2, 2026 — eight models spanning 0.8B to 397B parameters, all Apache 2.0 licensed. The flagship model claims to beat GPT-5.2, Claude Opus 4.5, and Gemini 3 Pro across 80% of benchmark categories.
But benchmarks are benchmarks. What matters for deployment: how much VRAM do you actually need, and is the Mixture of Experts architecture worth the memory trade-off?
The Full Lineup
Qwen 3.5 ships in two flavors: dense models where every parameter fires on every token, and MoE (Mixture of Experts) models where a router selects a subset of parameters per token.
Dense Models
| Model | Parameters | BF16 Memory | FP8 Memory |
|---|---|---|---|
| Qwen3.5-0.8B | 873M | 1.63 GB | — |
| Qwen3.5-2B | 2.27B | 4.24 GB | — |
| Qwen3.5-4B | 4.66B | 8.68 GB | — |
| Qwen3.5-9B | 9.65B | 17.98 GB | — |
| Qwen3.5-27B | 27.78B | 51.75 GB | 28.75 GB |
The small models (0.8B through 9B) are BF16-only — no FP8 variants published. The 27B model gets an FP8 option that nearly halves the memory footprint.
MoE Models
The naming convention tells you everything: 35B-A3B means 35B total parameters, 3B active per token.
| Model | Total Params | Active Params | BF16 Memory | FP8 Memory |
|---|---|---|---|---|
| Qwen3.5-35B-A3B | 35.95B | ~3B | 66.97 GB | 34.88 GB |
| Qwen3.5-122B-A10B | 125.09B | ~10B | 232.99 GB | 118.42 GB |
| Qwen3.5-397B-A17B | 403.40B | ~17B | 751.39 GB | 378.23 GB |
The 397B flagship needs 378 GB in FP8 — that's five A100-80GB GPUs at minimum. The 35B MoE model is the most practical: it fits in 35 GB (FP8) on a single high-end GPU while delivering inference speed comparable to a 4B dense model.
How Mixture of Experts Works
In a standard dense transformer, every parameter participates in every forward pass. A 27B dense model activates all 27B parameters for each token — that's the compute cost you pay.
MoE models split their feed-forward layers into multiple independent "expert" sub-networks. A lightweight router selects only a few experts per token. Most parameters stay idle during any given forward pass.
┌─────────┐
┌────────>│ Expert 1 │──────────┐
│ └─────────┘ │
Input ──> Router ──> Output
│ ┌─────────┐ │
└────────>│ Expert 3 │──────────┘
└─────────┘
(Experts 2, 4...N idle)
The Trade-off in One Table
| Model | Active Compute | Knowledge Capacity | VRAM Needed |
|---|---|---|---|
| Qwen3.5-4B (dense) | 4.66B | 4.66B | 8.68 GB |
| Qwen3.5-35B-A3B (MoE) | ~3B | 35.95B | 66.97 GB |
Both activate roughly the same number of parameters per token (~3-4B), so inference speed is similar. But the MoE model carries 35B total parameters of learned knowledge versus only 4B — you get 4B-speed inference with 35B-quality answers.
The catch: all 35B parameters must sit in VRAM even though only 3B fire per token. MoE is essentially "I have the VRAM to spare, give me better answers without slowing down inference."
If you don't have the VRAM, a dense model that actually fits will beat a MoE model you can't load.
When to Use Which
| Scenario | Better Choice |
|---|---|
| Limited VRAM, need quality | Dense model that fits (e.g., 9B dense in 18 GB) |
| Enough VRAM, want best quality/speed | MoE (e.g., 35B-A3B: 3B compute, 35B knowledge) |
| Serving many concurrent users | MoE — high throughput at lower compute per request |
| Single-user, small batch | Dense model is simpler and equally fast |
What's New in 3.5 vs Qwen 3
The architecture changes that matter:
-
Expanded vocabulary — 250K tokens (up from 152K in Qwen 3). This means 10-60% fewer tokens for multilingual text, directly translating to lower inference cost and faster responses.
-
Native multimodal training — Vision and language trained together from the start ("early fusion"), not bolted on later. Processes images up to 1344x1344 and video at 8 FPS.
-
Hybrid attention with Delta Networks — Gated Delta Networks combined with sparse MoE for more efficient inference. The practical result: 8.6x faster decoding at 32K context, up to 19x at 256K context versus Qwen 3.
-
201 languages — Up from the already broad multilingual support in Qwen 3.
-
Reinforcement learning at scale — Trained across "million-agent environments" with progressively complex tasks, specifically targeting agentic use cases (tool calling, multi-step workflows, code execution).
Benchmark Results
The 397B flagship hits strong numbers:
| Benchmark | Qwen3.5-397B | What It Tests |
|---|---|---|
| GPQA Diamond | 88.4 | Graduate-level reasoning |
| AIME 2026 | 91.3 | Olympiad mathematics |
| LiveCodeBench v6 | 83.6 | Competitive programming |
| SWE-bench Verified | 76.4 | Real-world software engineering |
| IFEval | 92.6 | Instruction following |
| MMLU | 88.5 | General knowledge |
| MathVision | 90.8 | Mathematical visual reasoning |
| MMMU | 85.0 | Multimodal understanding |
The GPQA Diamond score of 88.4 is the highest of any open-source model. The SWE-bench Verified score of 76.4 shows competitive real-world coding ability — for reference, Claude Opus 4.6 scores above 80%.
On the hosted API side, Qwen 3.5-Plus (the proprietary variant) runs at ~$0.18 per million tokens, making it one of the cheapest frontier-tier options.
The Competition: March 2026
Qwen 3.5 is too new for Chatbot Arena ELO ratings, but the open-source leaderboard tells a clear story about who's competing:
| Rank | Model | Organization | ELO |
|---|---|---|---|
| 1 | GLM-5 | Zhipu AI | 1451 |
| 2 | Kimi K2.5 | Moonshot AI | 1447 |
| 3 | GLM-4.7 | Zhipu AI | 1445 |
| 4 | Qwen 3 235B | Alibaba | 1422 |
| 5 | DeepSeek V3.2 | DeepSeek | 1421 |
| 6 | Mistral Large | Mistral | 1416 |
| 7 | DeepSeek R1 | DeepSeek | 1398 |
Who's the Real Threat
GLM-5 / GLM-4.7 (Zhipu AI) currently sit at #1 and #3 by human preference. These are the models to beat. GLM-5 in particular has been remarkably consistent across diverse tasks.
Kimi K2.5 (Moonshot AI) is right on GLM-5's heels — a strong all-rounder that doesn't dominate any single benchmark but rarely fails either.
DeepSeek V3.2 / R1 — R1 dominates long-chain reasoning and math. V3.2 is the more practical general-purpose model. Together they cover a lot of ground.
Step-3.5-Flash (StepFun) deserves a mention: only 196B parameters but scores 97.3 on AIME 2025, the highest math score on the board. Proves that raw parameter count isn't everything.
The Pattern
The open-source LLM race is heavily dominated by Chinese labs — Alibaba, Zhipu, Moonshot, DeepSeek, StepFun. The main non-Chinese competitors are Mistral (France) and Google Gemma. Meta's Llama, once the default open-source choice, hasn't kept pace at the top of the leaderboard.
Practical Takeaway
Qwen 3.5 memory requirements — choose the right model for your GPU for deployment today:
- Under 10 GB VRAM — Qwen3.5-4B dense (8.68 GB BF16) or Qwen3.5-2B for lighter workloads
- 24 GB VRAM (RTX 4090) — Qwen3.5-9B dense (17.98 GB) is the sweet spot. Fast, capable, fits with room for context
- 32 GB VRAM (RTX 5090) — Qwen3.5-9B dense with plenty of headroom for long context, or Qwen3.5-27B in FP8 (28.75 GB) if you want to push quality higher
- 48 GB VRAM (A6000, dual consumer GPUs) — Qwen3.5-35B-A3B in FP8 (34.88 GB). MoE gives you 35B knowledge at 3B speed
- Multi-GPU server — Qwen3.5-122B-A10B or the 397B flagship, depending on how many GPUs you can throw at it
For most business deployments — product assistants, customer support, content generation — the 9B dense or 35B MoE models hit the practical sweet spot. The 397B flagship is impressive on benchmarks but requires serious infrastructure.
The broader trend: open-source models are closing the gap with proprietary ones fast. Qwen 3.5's benchmark numbers put it within striking distance of GPT-5.2 and Claude Opus 4.5, and it ships with Apache 2.0. For businesses that care about data privacy, cost control, and customization, that matters more than who's #1 on any given leaderboard.