Fundamentals 7 min read

Mixture of Experts (MoE), Explained

ai.rs Jun 26, 2026

moe mixture-of-experts architecture inference fundamentals

Almost every big open model shipping in 2026 — Kimi K2.6, Qwen, gpt-oss, Qwen-AgentWorld — is a Mixture of Experts (MoE). It's the reason a model can advertise a trillion parameters yet run at the speed of a small one. Here's the idea in plain terms, the trade-off it makes, and when it actually wins.

Dense vs Mixture of Experts

In a dense model, every parameter participates in every token. A 32B dense model does 32B parameters' worth of math for each token it reads or writes — full price, every time.

In an MoE model, the big feed-forward layers are split into many expert sub-networks. A small router looks at each token and picks just a few experts to run; the rest sit idle. So the model holds a huge number of total parameters (its knowledge), but only activates a small number — the active parameters — per token (its cost).

The two numbers that matter: total vs active

Every MoE is described by two parameter counts:

Model	Total params	Active / token	Experts
Qwen3.5-35B-A3B	35B	3B	top-k routed
Kimi K2.6	1T	32B	8 of 384 (+1 shared)
Qwen-AgentWorld-397B-A17B	397B	17B	top-k routed

Read it as: inference cost ≈ active params; answer quality ≈ total params. Kimi K2.6 costs about as much per token as a 32B model but answers with the knowledge of a 1T one.

Why MoE wins

Token generation is memory-bandwidth-bound — every token streams the active weights out of memory (why). Because MoE only reads its active experts per token, it gets the speed of a small model while carrying the knowledge of a giant one. That is the whole pitch, and it's why the open frontier has gone almost entirely MoE.

The catch: memory

MoE buys you speed and quality — not a smaller memory footprint. Even though only a few experts fire per token, all of them must sit in VRAM, because the router might pick any of them for the next token. So a 1T-parameter MoE needs ~1T parameters' worth of memory (hundreds of GB even at 4-bit) while doing only 32B of work. (How much VRAM, and on what hardware.)

This is the defining trade-off: MoE trades memory capacity for inference speed. With the VRAM, you get frontier quality cheaply. Without it, a dense model that fits beats a MoE you can't load.

A few details

Shared expert — many MoEs (Kimi included) keep one expert that always runs, to capture common patterns, on top of the routed ones.
Top-k routing — the router scores all experts and picks the top k (e.g. 8 of 384). It's learned, not random.
Load balancing — training adds a loss term so the router spreads work across experts instead of overusing a few.

When to choose MoE vs dense

Your situation	Better choice
Plenty of VRAM, want best quality-per-speed	MoE
VRAM-constrained (single consumer GPU)	a dense model that fits
Serving many users at once	MoE (high throughput per FLOP)
Simplicity / small single-GPU deploy	dense

Bottom line

MoE = many experts, a few fire per token. Total params = knowledge; active params = cost.
It delivers a giant model's quality at a small model's speed — why open frontier models are nearly all MoE now.
The price is memory: all experts live in VRAM regardless. MoE doesn't shrink the model; it shrinks the work.

Ready to put this into practice?

Understanding the fundamentals is one thing — building something for your business is another. See where you stand.

Take the AI Readiness Check

Share: Post Share

Mixture of Experts (MoE), Explained

Dense vs Mixture of Experts

The two numbers that matter: total vs active

Why MoE wins

The catch: memory

A few details

When to choose MoE vs dense

Bottom line

Ready to put this into practice?

Read next

What a 256K (or 1M) Context Window Actually Costs You

Qwen-AgentWorld: the Open Language World Model for AI Agents

Kimi K2.6 Explained: a Trillion-Parameter Open Model

On This Page

Developer Corner

Dense vs Mixture of Experts

The two numbers that matter: total vs active

Why MoE wins

The catch: memory

A few details

When to choose MoE vs dense

Bottom line

Related reading

Ready to put this into practice?

Read next

What a 256K (or 1M) Context Window Actually Costs You

Qwen-AgentWorld: the Open Language World Model for AI Agents

Kimi K2.6 Explained: a Trillion-Parameter Open Model

On This Page

Developer Corner