AI News 9 min read

Kimi K2.6 Explained: a Trillion-Parameter Open Model

ai.rs Jun 14, 2026

kimi moonshot moe open-source benchmarks self-hosting

Moonshot AI shipped Kimi K2.6 on April 20, 2026 — open weights, a permissive Modified MIT license, and benchmark numbers that put an open model level with the closed frontier on coding. That combination is why it is one of the most-searched model names right now.

Under the hood it is a 1-trillion-parameter Mixture-of-Experts that activates only 32 billion parameters per token, with native multimodality (a MoonViT vision encoder) and a 256K-token context window. Here is the architecture, the very real memory bill, how to actually run it, and where it lands on benchmarks.

A trillion parameters, 32 billion at a time

K2.6's headline trick is Mixture of Experts (MoE). Instead of one dense network where every parameter fires on every token, the feed-forward layers are split into 384 expert sub-networks. A lightweight router picks just 8 experts per token (plus 1 shared expert that always runs), so only about 32B of the 1T parameters do any work on a given token.

Spec	Kimi K2.6
Total parameters	1T
Active per token	32B
Experts	384 (8 routed + 1 shared)
Layers	61 (1 dense)
Attention	MLA (Multi-head Latent Attention), 64 heads, 7,168 hidden
Expert hidden dim	2,048
Vocabulary	160K
Context window	256K tokens
Vision	MoonViT (400M)
Activation	SwiGLU

The payoff: you get the knowledge capacity of a 1T model at the inference cost of a ~32B one. The catch — and it is a big one — is that all 1T parameters still have to sit in memory even though only 32B fire per token. MoE buys you speed and quality, not a smaller memory footprint.

One detail matters for that 256K context: K2.6 uses MLA (Multi-head Latent Attention), which compresses the KV cache far below what vanilla multi-head attention needs. (If "KV cache" does not mean anything yet, see our explainer: The KV Cache.) Without MLA, a 256K context on a model this size would be unservable.

The memory bill

This is where a 1T model gets real. The weights alone:

Precision	Weights	GPUs (80 GB)
FP16 / BF16	~2 TB	8x H100
Native INT4 (QAT)	~594 GB	4x H100

The good news is that K2.6 ships a native INT4 checkpoint — quantization-aware trained (the same approach as Kimi-K2-Thinking), not a lossy afterthought. That makes 4x H100 (80 GB) the realistic floor for self-hosting at full quality, versus 8x for FP16. On top of the weights you pay for the KV cache, but MLA keeps that modest even at long context.

Translation for most teams: you are probably not going to self-host this. A 4x H100 box is a serious commitment for a single model. That is what the hosted APIs are for.

How to actually run it

Use an API (recommended for almost everyone). K2.6 is available through Moonshot's own platform and a growing list of third parties — GMI Cloud, DeepInfra, OpenRouter, Lambda — all behind OpenAI-compatible endpoints, so switching is a base-URL-and-key change. Indicative pricing is about $0.68 per million input tokens and $3.41 per million output, with measured ~0.53s time-to-first-token and ~77 tokens/sec on DeepInfra. That is a fraction of frontier closed-model pricing for comparable coding ability.

Self-host (open weights on Hugging Face, Modified MIT). If you need data residency, very high volume, or fine-tuning, the supported engines are vLLM, SGLang (>= 0.5.10), and KTransformers — again all OpenAI-compatible. The native INT4 build runs on 4x H100; FP16 wants 8x. KTransformers is worth a look if you are GPU-constrained: it offloads experts to system RAM and runs the hot path on GPU, trading speed for a smaller GPU bill.

Rule of thumb: API unless privacy, volume, or customization forces self-hosting.

Benchmarks: built for agents and code

Moonshot positions K2.6 as a coding and agentic model first, and the numbers back the framing (these are vendor-reported; independent reproductions will vary):

Benchmark	K2.6	Measures
SWE-Bench Pro	58.6	real-world software engineering
SWE-bench Multilingual	76.7	coding across languages
HLE (with tools)	54.0	hard reasoning with tool use
BrowseComp	83.2	autonomous web research
Toolathlon	50.0	multi-tool orchestration
CharXiv (with Python)	86.7	chart / figure understanding
MathVision (with Python)	93.2	visual mathematics
AIME 2026	96.4	competition math

The eye-catcher is SWE-Bench Pro, where an open model noses ahead of the closed frontier:

Model	SWE-Bench Pro
Kimi K2.6	58.6
GPT-5.4 (xhigh)	57.7
Gemini 3.1 Pro	54.2
Claude Opus 4.6	53.4

Where it gives ground is pure math reasoning — on AIME 2026 its 96.4 trails GPT-5.4 (99.2) and Gemini 3.1 Pro (98.3) and roughly matches Claude Opus 4.6 (96.7). So if your workload is olympiad-style math, the very top closed models still edge it; if it is shipping code and running tools, K2.6 is at or above them.

The other headline is Agent Swarm: K2.6 can fan a task out to as many as 300 domain-specialized sub-agents and run up to 4,000 coordinated steps in a single autonomous pass. It is the feature behind its agentic-search jump — BrowseComp climbs from K2.5's ~78 to 86+ in swarm mode.

The bottom line

The story: frontier-class coding and agentic capability with open weights and a permissive license — and on SWE-Bench Pro it edges GPT-5.4, Gemini 3.1 Pro, and Claude Opus 4.6.
Architecture: 1T-parameter MoE, 32B active (8 of 384 experts + 1 shared), MLA attention, 256K context, native multimodal.
Memory: the trade-off MoE does not solve — ~594 GB at native INT4 (4x H100), ~2 TB at FP16 (8x). Most teams should hit an API, not a GPU rack.
Pick it when: you want open-weight coding/agent performance at the frontier, or you must self-host for privacy/volume/fine-tuning. Look elsewhere when: your bottleneck is pure math reasoning, where the closed leaders still win.

Weights and the model card are on Hugging Face.

What does this mean for your business?

New models drop every month. The real question is whether the underlying capability fits your business. Find out in 2 minutes.

Take the AI Readiness Check

Share: Post Share

Kimi K2.6 Explained: a Trillion-Parameter Open Model

A trillion parameters, 32 billion at a time

The memory bill

How to actually run it

Benchmarks: built for agents and code

The bottom line

What does this mean for your business?

Read next

Qwen 3.5: 35B Knowledge at 4B Speed — Better Than GPT-5?

Qwen-AgentWorld: the Open Language World Model for AI Agents

Gemma 4 LoRA Fine-Tuning on RTX 5090: What Works and What Doesn't

On This Page

AI News

A trillion parameters, 32 billion at a time

The memory bill

How to actually run it

Benchmarks: built for agents and code

The bottom line

Related reading

What does this mean for your business?

Read next

Qwen 3.5: 35B Knowledge at 4B Speed — Better Than GPT-5?

Qwen-AgentWorld: the Open Language World Model for AI Agents

Gemma 4 LoRA Fine-Tuning on RTX 5090: What Works and What Doesn't

On This Page

AI News