AI News 12 min read

Qwen 3.5: 35B Knowledge at 4B Speed — Better Than GPT-5?

ai.rs Mar 4, 2026

llm qwen moe benchmarks self-hosting open-source

Alibaba released Qwen 3.5 between February 16 and March 2, 2026 — eight models spanning 0.8B to 397B parameters, all Apache 2.0 licensed. The flagship model claims to beat GPT-5.2, Claude Opus 4.5, and Gemini 3 Pro across 80% of benchmark categories.

But benchmarks are benchmarks. What matters for deployment: how much VRAM do you actually need, and is the Mixture of Experts architecture worth the memory trade-off?

The Full Lineup

Qwen 3.5 ships in two flavors: dense models where every parameter fires on every token, and MoE (Mixture of Experts) models where a router selects a subset of parameters per token.

Dense Models

Model	Parameters	BF16 Memory	FP8 Memory
Qwen3.5-0.8B	873M	1.63 GB	—
Qwen3.5-2B	2.27B	4.24 GB	—
Qwen3.5-4B	4.66B	8.68 GB	—
Qwen3.5-9B	9.65B	17.98 GB	—
Qwen3.5-27B	27.78B	51.75 GB	28.75 GB

The small models (0.8B through 9B) are BF16-only — no FP8 variants published. The 27B model gets an FP8 option that nearly halves the memory footprint.

MoE Models

The naming convention tells you everything: 35B-A3B means 35B total parameters, 3B active per token.

Model	Total Params	Active Params	BF16 Memory	FP8 Memory
Qwen3.5-35B-A3B	35.95B	~3B	66.97 GB	34.88 GB
Qwen3.5-122B-A10B	125.09B	~10B	232.99 GB	118.42 GB
Qwen3.5-397B-A17B	403.40B	~17B	751.39 GB	378.23 GB

The 397B flagship needs 378 GB in FP8 — that's five A100-80GB GPUs at minimum. The 35B MoE model is the most practical: it fits in 35 GB (FP8) on a single high-end GPU while delivering inference speed comparable to a 4B dense model.

How Mixture of Experts Works

In a standard dense transformer, every parameter participates in every forward pass. A 27B dense model activates all 27B parameters for each token — that's the compute cost you pay.

MoE models split their feed-forward layers into multiple independent "expert" sub-networks. A lightweight router selects only a few experts per token. Most parameters stay idle during any given forward pass.

                    ┌─────────┐
          ┌────────>│ Expert 1 │──────────┐
          │         └─────────┘           │
Input ──> Router                      ──> Output
          │         ┌─────────┐           │
          └────────>│ Expert 3 │──────────┘
                    └─────────┘
            (Experts 2, 4...N idle)

The Trade-off in One Table

Model	Active Compute	Knowledge Capacity	VRAM Needed
Qwen3.5-4B (dense)	4.66B	4.66B	8.68 GB
Qwen3.5-35B-A3B (MoE)	~3B	35.95B	66.97 GB

Both activate roughly the same number of parameters per token (~3-4B), so inference speed is similar. But the MoE model carries 35B total parameters of learned knowledge versus only 4B — you get 4B-speed inference with 35B-quality answers.

The catch: all 35B parameters must sit in VRAM even though only 3B fire per token. MoE is essentially "I have the VRAM to spare, give me better answers without slowing down inference."

If you don't have the VRAM, a dense model that actually fits will beat a MoE model you can't load.

When to Use Which

Scenario	Better Choice
Limited VRAM, need quality	Dense model that fits (e.g., 9B dense in 18 GB)
Enough VRAM, want best quality/speed	MoE (e.g., 35B-A3B: 3B compute, 35B knowledge)
Serving many concurrent users	MoE — high throughput at lower compute per request
Single-user, small batch	Dense model is simpler and equally fast

What's New in 3.5 vs Qwen 3

The architecture changes that matter:

Expanded vocabulary — 250K tokens (up from 152K in Qwen 3). This means 10-60% fewer tokens for multilingual text, directly translating to lower inference cost and faster responses.
Native multimodal training — Vision and language trained together from the start ("early fusion"), not bolted on later. Processes images up to 1344x1344 and video at 8 FPS.
Hybrid attention with Delta Networks — Gated Delta Networks combined with sparse MoE for more efficient inference. The practical result: 8.6x faster decoding at 32K context, up to 19x at 256K context versus Qwen 3.
201 languages — Up from the already broad multilingual support in Qwen 3.
Reinforcement learning at scale — Trained across "million-agent environments" with progressively complex tasks, specifically targeting agentic use cases (tool calling, multi-step workflows, code execution).

Benchmark Results

The 397B flagship hits strong numbers:

Benchmark	Qwen3.5-397B	What It Tests
GPQA Diamond	88.4	Graduate-level reasoning
AIME 2026	91.3	Olympiad mathematics
LiveCodeBench v6	83.6	Competitive programming
SWE-bench Verified	76.4	Real-world software engineering
IFEval	92.6	Instruction following
MMLU	88.5	General knowledge
MathVision	90.8	Mathematical visual reasoning
MMMU	85.0	Multimodal understanding

The GPQA Diamond score of 88.4 is the highest of any open-source model. The SWE-bench Verified score of 76.4 shows competitive real-world coding ability — for reference, Claude Opus 4.6 scores above 80%.

On the hosted API side, Qwen 3.5-Plus (the proprietary variant) runs at ~$0.18 per million tokens, making it one of the cheapest frontier-tier options.

The Competition: March 2026

Qwen 3.5 is too new for Chatbot Arena ELO ratings, but the open-source leaderboard tells a clear story about who's competing:

Rank	Model	Organization	ELO
1	GLM-5	Zhipu AI	1451
2	Kimi K2.5	Moonshot AI	1447
3	GLM-4.7	Zhipu AI	1445
4	Qwen 3 235B	Alibaba	1422
5	DeepSeek V3.2	DeepSeek	1421
6	Mistral Large	Mistral	1416
7	DeepSeek R1	DeepSeek	1398

Who's the Real Threat

GLM-5 / GLM-4.7 (Zhipu AI) currently sit at #1 and #3 by human preference. These are the models to beat. GLM-5 in particular has been remarkably consistent across diverse tasks.

Kimi K2.5 (Moonshot AI) is right on GLM-5's heels — a strong all-rounder that doesn't dominate any single benchmark but rarely fails either.

DeepSeek V3.2 / R1 — R1 dominates long-chain reasoning and math. V3.2 is the more practical general-purpose model. Together they cover a lot of ground.

Step-3.5-Flash (StepFun) deserves a mention: only 196B parameters but scores 97.3 on AIME 2025, the highest math score on the board. Proves that raw parameter count isn't everything.

The Pattern

The open-source LLM race is heavily dominated by Chinese labs — Alibaba, Zhipu, Moonshot, DeepSeek, StepFun. The main non-Chinese competitors are Mistral (France) and Google Gemma. Meta's Llama, once the default open-source choice, hasn't kept pace at the top of the leaderboard.

Practical Takeaway

Qwen 3.5 memory requirements — choose the right model for your GPU for deployment today:

Under 10 GB VRAM — Qwen3.5-4B dense (8.68 GB BF16) or Qwen3.5-2B for lighter workloads
24 GB VRAM (RTX 4090) — Qwen3.5-9B dense (17.98 GB) is the sweet spot. Fast, capable, fits with room for context
32 GB VRAM (RTX 5090) — Qwen3.5-9B dense with plenty of headroom for long context, or Qwen3.5-27B in FP8 (28.75 GB) if you want to push quality higher
48 GB VRAM (A6000, dual consumer GPUs) — Qwen3.5-35B-A3B in FP8 (34.88 GB). MoE gives you 35B knowledge at 3B speed
Multi-GPU server — Qwen3.5-122B-A10B or the 397B flagship, depending on how many GPUs you can throw at it

For most business deployments — product assistants, customer support, content generation — the 9B dense or 35B MoE models hit the practical sweet spot. The 397B flagship is impressive on benchmarks but requires serious infrastructure.

The broader trend: open-source models are closing the gap with proprietary ones fast. Qwen 3.5's benchmark numbers put it within striking distance of GPT-5.2 and Claude Opus 4.5, and it ships with Apache 2.0. For businesses that care about data privacy, cost control, and customization, that matters more than who's #1 on any given leaderboard.

What does this mean for your business?

New models drop every month. The real question is whether the underlying capability fits your business. Find out in 2 minutes.

Take the AI Readiness Check

Share: Post Share