Fundamentals 7 min read

Mixture of Experts (MoE), Explained

ai.rs Jun 26, 2026
Mixture of Experts (MoE), Explained illustration

Almost every big open model shipping in 2026 — Kimi K2.6, Qwen, gpt-oss, Qwen-AgentWorld — is a Mixture of Experts (MoE). It's the reason a model can advertise a trillion parameters yet run at the speed of a small one. Here's the idea in plain terms, the trade-off it makes, and when it actually wins.

Dense vs Mixture of Experts

In a dense model, every parameter participates in every token. A 32B dense model does 32B parameters' worth of math for each token it reads or writes — full price, every time.

In an MoE model, the big feed-forward layers are split into many expert sub-networks. A small router looks at each token and picks just a few experts to run; the rest sit idle. So the model holds a huge number of total parameters (its knowledge), but only activates a small number — the active parameters — per token (its cost).

The two numbers that matter: total vs active

Every MoE is described by two parameter counts:

Model Total params Active / token Experts
Qwen3.5-35B-A3B 35B 3B top-k routed
Kimi K2.6 1T 32B 8 of 384 (+1 shared)
Qwen-AgentWorld-397B-A17B 397B 17B top-k routed

Read it as: inference cost ≈ active params; answer quality ≈ total params. Kimi K2.6 costs about as much per token as a 32B model but answers with the knowledge of a 1T one.

Why MoE wins

Token generation is memory-bandwidth-bound — every token streams the active weights out of memory (why). Because MoE only reads its active experts per token, it gets the speed of a small model while carrying the knowledge of a giant one. That is the whole pitch, and it's why the open frontier has gone almost entirely MoE.

The catch: memory

MoE buys you speed and quality — not a smaller memory footprint. Even though only a few experts fire per token, all of them must sit in VRAM, because the router might pick any of them for the next token. So a 1T-parameter MoE needs ~1T parameters' worth of memory (hundreds of GB even at 4-bit) while doing only 32B of work. (How much VRAM, and on what hardware.)

This is the defining trade-off: MoE trades memory capacity for inference speed. With the VRAM, you get frontier quality cheaply. Without it, a dense model that fits beats a MoE you can't load.

A few details

  • Shared expert — many MoEs (Kimi included) keep one expert that always runs, to capture common patterns, on top of the routed ones.
  • Top-k routing — the router scores all experts and picks the top k (e.g. 8 of 384). It's learned, not random.
  • Load balancing — training adds a loss term so the router spreads work across experts instead of overusing a few.

When to choose MoE vs dense

Your situation Better choice
Plenty of VRAM, want best quality-per-speed MoE
VRAM-constrained (single consumer GPU) a dense model that fits
Serving many users at once MoE (high throughput per FLOP)
Simplicity / small single-GPU deploy dense

Bottom line

  • MoE = many experts, a few fire per token. Total params = knowledge; active params = cost.
  • It delivers a giant model's quality at a small model's speed — why open frontier models are nearly all MoE now.
  • The price is memory: all experts live in VRAM regardless. MoE doesn't shrink the model; it shrinks the work.

Ready to put this into practice?

Understanding the fundamentals is one thing — building something for your business is another. See where you stand.

Take the AI Readiness Check
Share: Post Share

Read next