Moonshot AI shipped Kimi K2.6 on April 20, 2026 — open weights, a permissive Modified MIT license, and benchmark numbers that put an open model level with the closed frontier on coding. That combination is why it is one of the most-searched model names right now.
Under the hood it is a 1-trillion-parameter Mixture-of-Experts that activates only 32 billion parameters per token, with native multimodality (a MoonViT vision encoder) and a 256K-token context window. Here is the architecture, the very real memory bill, how to actually run it, and where it lands on benchmarks.
A trillion parameters, 32 billion at a time
K2.6's headline trick is Mixture of Experts (MoE). Instead of one dense network where every parameter fires on every token, the feed-forward layers are split into 384 expert sub-networks. A lightweight router picks just 8 experts per token (plus 1 shared expert that always runs), so only about 32B of the 1T parameters do any work on a given token.
| Spec | Kimi K2.6 |
|---|---|
| Total parameters | 1T |
| Active per token | 32B |
| Experts | 384 (8 routed + 1 shared) |
| Layers | 61 (1 dense) |
| Attention | MLA (Multi-head Latent Attention), 64 heads, 7,168 hidden |
| Expert hidden dim | 2,048 |
| Vocabulary | 160K |
| Context window | 256K tokens |
| Vision | MoonViT (400M) |
| Activation | SwiGLU |
The payoff: you get the knowledge capacity of a 1T model at the inference cost of a ~32B one. The catch — and it is a big one — is that all 1T parameters still have to sit in memory even though only 32B fire per token. MoE buys you speed and quality, not a smaller memory footprint.
One detail matters for that 256K context: K2.6 uses MLA (Multi-head Latent Attention), which compresses the KV cache far below what vanilla multi-head attention needs. (If "KV cache" does not mean anything yet, see our explainer: The KV Cache.) Without MLA, a 256K context on a model this size would be unservable.
The memory bill
This is where a 1T model gets real. The weights alone:
| Precision | Weights | GPUs (80 GB) |
|---|---|---|
| FP16 / BF16 | ~2 TB | 8x H100 |
| Native INT4 (QAT) | ~594 GB | 4x H100 |
The good news is that K2.6 ships a native INT4 checkpoint — quantization-aware trained (the same approach as Kimi-K2-Thinking), not a lossy afterthought. That makes 4x H100 (80 GB) the realistic floor for self-hosting at full quality, versus 8x for FP16. On top of the weights you pay for the KV cache, but MLA keeps that modest even at long context.
Translation for most teams: you are probably not going to self-host this. A 4x H100 box is a serious commitment for a single model. That is what the hosted APIs are for.
How to actually run it
Use an API (recommended for almost everyone). K2.6 is available through Moonshot's own platform and a growing list of third parties — GMI Cloud, DeepInfra, OpenRouter, Lambda — all behind OpenAI-compatible endpoints, so switching is a base-URL-and-key change. Indicative pricing is about $0.68 per million input tokens and $3.41 per million output, with measured ~0.53s time-to-first-token and ~77 tokens/sec on DeepInfra. That is a fraction of frontier closed-model pricing for comparable coding ability.
Self-host (open weights on Hugging Face, Modified MIT). If you need data residency, very high volume, or fine-tuning, the supported engines are vLLM, SGLang (>= 0.5.10), and KTransformers — again all OpenAI-compatible. The native INT4 build runs on 4x H100; FP16 wants 8x. KTransformers is worth a look if you are GPU-constrained: it offloads experts to system RAM and runs the hot path on GPU, trading speed for a smaller GPU bill.
Rule of thumb: API unless privacy, volume, or customization forces self-hosting.
Benchmarks: built for agents and code
Moonshot positions K2.6 as a coding and agentic model first, and the numbers back the framing (these are vendor-reported; independent reproductions will vary):
| Benchmark | K2.6 | Measures |
|---|---|---|
| SWE-Bench Pro | 58.6 | real-world software engineering |
| SWE-bench Multilingual | 76.7 | coding across languages |
| HLE (with tools) | 54.0 | hard reasoning with tool use |
| BrowseComp | 83.2 | autonomous web research |
| Toolathlon | 50.0 | multi-tool orchestration |
| CharXiv (with Python) | 86.7 | chart / figure understanding |
| MathVision (with Python) | 93.2 | visual mathematics |
| AIME 2026 | 96.4 | competition math |
The eye-catcher is SWE-Bench Pro, where an open model noses ahead of the closed frontier:
| Model | SWE-Bench Pro |
|---|---|
| Kimi K2.6 | 58.6 |
| GPT-5.4 (xhigh) | 57.7 |
| Gemini 3.1 Pro | 54.2 |
| Claude Opus 4.6 | 53.4 |
Where it gives ground is pure math reasoning — on AIME 2026 its 96.4 trails GPT-5.4 (99.2) and Gemini 3.1 Pro (98.3) and roughly matches Claude Opus 4.6 (96.7). So if your workload is olympiad-style math, the very top closed models still edge it; if it is shipping code and running tools, K2.6 is at or above them.
The other headline is Agent Swarm: K2.6 can fan a task out to as many as 300 domain-specialized sub-agents and run up to 4,000 coordinated steps in a single autonomous pass. It is the feature behind its agentic-search jump — BrowseComp climbs from K2.5's ~78 to 86+ in swarm mode.
The bottom line
- The story: frontier-class coding and agentic capability with open weights and a permissive license — and on SWE-Bench Pro it edges GPT-5.4, Gemini 3.1 Pro, and Claude Opus 4.6.
- Architecture: 1T-parameter MoE, 32B active (8 of 384 experts + 1 shared), MLA attention, 256K context, native multimodal.
- Memory: the trade-off MoE does not solve — ~594 GB at native INT4 (4x H100), ~2 TB at FP16 (8x). Most teams should hit an API, not a GPU rack.
- Pick it when: you want open-weight coding/agent performance at the frontier, or you must self-host for privacy/volume/fine-tuning. Look elsewhere when: your bottleneck is pure math reasoning, where the closed leaders still win.
Weights and the model card are on Hugging Face.