Almost every big open model shipping in 2026 — Kimi K2.6, Qwen, gpt-oss, Qwen-AgentWorld — is a Mixture of Experts (MoE). It's the reason a model can advertise a trillion parameters yet run at the speed of a small one. Here's the idea in plain terms, the trade-off it makes, and when it actually wins.
Dense vs Mixture of Experts
In a dense model, every parameter participates in every token. A 32B dense model does 32B parameters' worth of math for each token it reads or writes — full price, every time.
In an MoE model, the big feed-forward layers are split into many expert sub-networks. A small router looks at each token and picks just a few experts to run; the rest sit idle. So the model holds a huge number of total parameters (its knowledge), but only activates a small number — the active parameters — per token (its cost).
The two numbers that matter: total vs active
Every MoE is described by two parameter counts:
| Model | Total params | Active / token | Experts |
|---|---|---|---|
| Qwen3.5-35B-A3B | 35B | 3B | top-k routed |
| Kimi K2.6 | 1T | 32B | 8 of 384 (+1 shared) |
| Qwen-AgentWorld-397B-A17B | 397B | 17B | top-k routed |
Read it as: inference cost ≈ active params; answer quality ≈ total params. Kimi K2.6 costs about as much per token as a 32B model but answers with the knowledge of a 1T one.
Why MoE wins
Token generation is memory-bandwidth-bound — every token streams the active weights out of memory (why). Because MoE only reads its active experts per token, it gets the speed of a small model while carrying the knowledge of a giant one. That is the whole pitch, and it's why the open frontier has gone almost entirely MoE.
The catch: memory
MoE buys you speed and quality — not a smaller memory footprint. Even though only a few experts fire per token, all of them must sit in VRAM, because the router might pick any of them for the next token. So a 1T-parameter MoE needs ~1T parameters' worth of memory (hundreds of GB even at 4-bit) while doing only 32B of work. (How much VRAM, and on what hardware.)
This is the defining trade-off: MoE trades memory capacity for inference speed. With the VRAM, you get frontier quality cheaply. Without it, a dense model that fits beats a MoE you can't load.
A few details
- Shared expert — many MoEs (Kimi included) keep one expert that always runs, to capture common patterns, on top of the routed ones.
- Top-k routing — the router scores all experts and picks the top k (e.g. 8 of 384). It's learned, not random.
- Load balancing — training adds a loss term so the router spreads work across experts instead of overusing a few.
When to choose MoE vs dense
| Your situation | Better choice |
|---|---|
| Plenty of VRAM, want best quality-per-speed | MoE |
| VRAM-constrained (single consumer GPU) | a dense model that fits |
| Serving many users at once | MoE (high throughput per FLOP) |
| Simplicity / small single-GPU deploy | dense |
Bottom line
- MoE = many experts, a few fire per token. Total params = knowledge; active params = cost.
- It delivers a giant model's quality at a small model's speed — why open frontier models are nearly all MoE now.
- The price is memory: all experts live in VRAM regardless. MoE doesn't shrink the model; it shrinks the work.