AI News 12 min read

Qwen 3.5: 35B Knowledge at 4B Speed — Better Than GPT-5?

ai.rs Mar 4, 2026

Alibaba released Qwen 3.5 between February 16 and March 2, 2026 — eight models spanning 0.8B to 397B parameters, all Apache 2.0 licensed. The flagship model claims to beat GPT-5.2, Claude Opus 4.5, and Gemini 3 Pro across 80% of benchmark categories.

But benchmarks are benchmarks. What matters for deployment: how much VRAM do you actually need, and is the Mixture of Experts architecture worth the memory trade-off?


The Full Lineup

Qwen 3.5 ships in two flavors: dense models where every parameter fires on every token, and MoE (Mixture of Experts) models where a router selects a subset of parameters per token.

Dense Models

Model Parameters BF16 Memory FP8 Memory
Qwen3.5-0.8B 873M 1.63 GB
Qwen3.5-2B 2.27B 4.24 GB
Qwen3.5-4B 4.66B 8.68 GB
Qwen3.5-9B 9.65B 17.98 GB
Qwen3.5-27B 27.78B 51.75 GB 28.75 GB

The small models (0.8B through 9B) are BF16-only — no FP8 variants published. The 27B model gets an FP8 option that nearly halves the memory footprint.

MoE Models

The naming convention tells you everything: 35B-A3B means 35B total parameters, 3B active per token.

Model Total Params Active Params BF16 Memory FP8 Memory
Qwen3.5-35B-A3B 35.95B ~3B 66.97 GB 34.88 GB
Qwen3.5-122B-A10B 125.09B ~10B 232.99 GB 118.42 GB
Qwen3.5-397B-A17B 403.40B ~17B 751.39 GB 378.23 GB

The 397B flagship needs 378 GB in FP8 — that's five A100-80GB GPUs at minimum. The 35B MoE model is the most practical: it fits in 35 GB (FP8) on a single high-end GPU while delivering inference speed comparable to a 4B dense model.

How Mixture of Experts Works

In a standard dense transformer, every parameter participates in every forward pass. A 27B dense model activates all 27B parameters for each token — that's the compute cost you pay.

MoE models split their feed-forward layers into multiple independent "expert" sub-networks. A lightweight router selects only a few experts per token. Most parameters stay idle during any given forward pass.

                    ┌─────────┐
          ┌────────>│ Expert 1 │──────────┐
          │         └─────────┘           │
Input ──> Router                      ──> Output
          │         ┌─────────┐           │
          └────────>│ Expert 3 │──────────┘
                    └─────────┘
            (Experts 2, 4...N idle)

The Trade-off in One Table

Model Active Compute Knowledge Capacity VRAM Needed
Qwen3.5-4B (dense) 4.66B 4.66B 8.68 GB
Qwen3.5-35B-A3B (MoE) ~3B 35.95B 66.97 GB

Both activate roughly the same number of parameters per token (~3-4B), so inference speed is similar. But the MoE model carries 35B total parameters of learned knowledge versus only 4B — you get 4B-speed inference with 35B-quality answers.

The catch: all 35B parameters must sit in VRAM even though only 3B fire per token. MoE is essentially "I have the VRAM to spare, give me better answers without slowing down inference."

If you don't have the VRAM, a dense model that actually fits will beat a MoE model you can't load.

When to Use Which

Scenario Better Choice
Limited VRAM, need quality Dense model that fits (e.g., 9B dense in 18 GB)
Enough VRAM, want best quality/speed MoE (e.g., 35B-A3B: 3B compute, 35B knowledge)
Serving many concurrent users MoE — high throughput at lower compute per request
Single-user, small batch Dense model is simpler and equally fast

What's New in 3.5 vs Qwen 3

The architecture changes that matter:

  1. Expanded vocabulary — 250K tokens (up from 152K in Qwen 3). This means 10-60% fewer tokens for multilingual text, directly translating to lower inference cost and faster responses.

  2. Native multimodal training — Vision and language trained together from the start ("early fusion"), not bolted on later. Processes images up to 1344x1344 and video at 8 FPS.

  3. Hybrid attention with Delta Networks — Gated Delta Networks combined with sparse MoE for more efficient inference. The practical result: 8.6x faster decoding at 32K context, up to 19x at 256K context versus Qwen 3.

  4. 201 languages — Up from the already broad multilingual support in Qwen 3.

  5. Reinforcement learning at scale — Trained across "million-agent environments" with progressively complex tasks, specifically targeting agentic use cases (tool calling, multi-step workflows, code execution).

Benchmark Results

The 397B flagship hits strong numbers:

Benchmark Qwen3.5-397B What It Tests
GPQA Diamond 88.4 Graduate-level reasoning
AIME 2026 91.3 Olympiad mathematics
LiveCodeBench v6 83.6 Competitive programming
SWE-bench Verified 76.4 Real-world software engineering
IFEval 92.6 Instruction following
MMLU 88.5 General knowledge
MathVision 90.8 Mathematical visual reasoning
MMMU 85.0 Multimodal understanding

The GPQA Diamond score of 88.4 is the highest of any open-source model. The SWE-bench Verified score of 76.4 shows competitive real-world coding ability — for reference, Claude Opus 4.6 scores above 80%.

On the hosted API side, Qwen 3.5-Plus (the proprietary variant) runs at ~$0.18 per million tokens, making it one of the cheapest frontier-tier options.

The Competition: March 2026

Qwen 3.5 is too new for Chatbot Arena ELO ratings, but the open-source leaderboard tells a clear story about who's competing:

Rank Model Organization ELO
1 GLM-5 Zhipu AI 1451
2 Kimi K2.5 Moonshot AI 1447
3 GLM-4.7 Zhipu AI 1445
4 Qwen 3 235B Alibaba 1422
5 DeepSeek V3.2 DeepSeek 1421
6 Mistral Large Mistral 1416
7 DeepSeek R1 DeepSeek 1398

Who's the Real Threat

GLM-5 / GLM-4.7 (Zhipu AI) currently sit at #1 and #3 by human preference. These are the models to beat. GLM-5 in particular has been remarkably consistent across diverse tasks.

Kimi K2.5 (Moonshot AI) is right on GLM-5's heels — a strong all-rounder that doesn't dominate any single benchmark but rarely fails either.

DeepSeek V3.2 / R1 — R1 dominates long-chain reasoning and math. V3.2 is the more practical general-purpose model. Together they cover a lot of ground.

Step-3.5-Flash (StepFun) deserves a mention: only 196B parameters but scores 97.3 on AIME 2025, the highest math score on the board. Proves that raw parameter count isn't everything.

The Pattern

The open-source LLM race is heavily dominated by Chinese labs — Alibaba, Zhipu, Moonshot, DeepSeek, StepFun. The main non-Chinese competitors are Mistral (France) and Google Gemma. Meta's Llama, once the default open-source choice, hasn't kept pace at the top of the leaderboard.

Practical Takeaway

Qwen 3.5 memory requirements — choose the right model for your GPU for deployment today:

  • Under 10 GB VRAM — Qwen3.5-4B dense (8.68 GB BF16) or Qwen3.5-2B for lighter workloads
  • 24 GB VRAM (RTX 4090) — Qwen3.5-9B dense (17.98 GB) is the sweet spot. Fast, capable, fits with room for context
  • 32 GB VRAM (RTX 5090) — Qwen3.5-9B dense with plenty of headroom for long context, or Qwen3.5-27B in FP8 (28.75 GB) if you want to push quality higher
  • 48 GB VRAM (A6000, dual consumer GPUs) — Qwen3.5-35B-A3B in FP8 (34.88 GB). MoE gives you 35B knowledge at 3B speed
  • Multi-GPU server — Qwen3.5-122B-A10B or the 397B flagship, depending on how many GPUs you can throw at it

For most business deployments — product assistants, customer support, content generation — the 9B dense or 35B MoE models hit the practical sweet spot. The 397B flagship is impressive on benchmarks but requires serious infrastructure.

The broader trend: open-source models are closing the gap with proprietary ones fast. Qwen 3.5's benchmark numbers put it within striking distance of GPT-5.2 and Claude Opus 4.5, and it ships with Apache 2.0. For businesses that care about data privacy, cost control, and customization, that matters more than who's #1 on any given leaderboard.

Share: Post Share

Related Articles