What Is Mercury 2?
Mercury 2 is the first commercial reasoning diffusion LLM from Inception Labs. Unlike every major LLM you've used — GPT, Claude, Llama — Mercury 2 doesn't generate tokens one at a time. It uses diffusion to produce multiple tokens in parallel, then refines them over a small number of steps.
The result: ~1,000 tokens per second output throughput on NVIDIA Blackwell GPUs.
For context, Claude 4.5 Haiku outputs ~89 tok/s and GPT-5 Mini ~71 tok/s. Mercury 2 is roughly 10× faster.
How Diffusion LLMs Work
Traditional LLMs are autoregressive: they predict one token, append it, then predict the next. This is inherently sequential — each token depends on all previous tokens.
Diffusion LLMs take a fundamentally different approach borrowed from image generation (Stable Diffusion, DALL-E):
- Start with noise — begin with a block of random tokens
- Refine in parallel — iteratively denoise all tokens simultaneously
- Converge — after a small number of refinement steps, the output is coherent text
This is called block diffusion. Because tokens are generated in parallel rather than sequentially, GPU utilization skyrockets — you're doing useful compute across all cores simultaneously instead of waiting for one token at a time.
Autoregressive (traditional):
Token 1 → Token 2 → Token 3 → Token 4 → ...
[sequential, ~100 tok/s]
Diffusion (Mercury 2):
[noise] → [rough draft] → [refined] → [final output]
[parallel, ~1,000 tok/s]
Benchmarks
Mercury 2 positions as a fast reasoning model — comparable to Claude 4.5 Haiku and GPT-5 Mini in quality, but dramatically faster:
| Benchmark | Mercury 2 | Claude 4.5 Haiku | GPT-5 Mini |
|---|---|---|---|
| AIME 2025 | 91.1 | ~90 | ~88 |
| GPQA | 73.6 | ~75 | ~72 |
| LiveCodeBench | 67.3 | ~65 | ~63 |
| IFBench | 71.3 | — | — |
| Output speed | ~1,000 tok/s | ~89 tok/s | ~71 tok/s |
This isn't competing with frontier models like Claude Opus or GPT-5 on the hardest reasoning tasks. It's targeting the fast agent tier — where speed matters more than peak intelligence.
Key Features
- 128K context window — handles large codebases and documents
- Tunable reasoning — adjust the quality/speed tradeoff per request
- Native tool use — function calling built in, not bolted on
- Schema-aligned JSON output — structured output without post-processing
- OpenAI API compatible — drop-in replacement, no code rewrites needed
Where This Matters: Agentic Workflows
The real impact isn't chat. It's agentic loops where an LLM runs hundreds of iterations:
- Code generation pipelines — write, test, fix, repeat. At 1,000 tok/s, each iteration takes seconds instead of minutes
- Multi-step reasoning — chain-of-thought that would take 30 seconds now takes 3
- Real-time applications — live coding assistants, interactive debugging, instant analysis
A developer on Hacker News proposed "intelligence per second" as the metric that matters: throughput × reasoning quality. Mercury 2 optimizes exactly this.
Hybrid Architecture Potential
The most interesting use case discussed in the community: frontier model for planning, diffusion model for execution.
Use Claude Opus or GPT-5 to create a high-level plan, then hand off to Mercury 2 for rapid iteration on individual steps. You get the best reasoning where it matters and maximum speed everywhere else.
Known Limitations
Mercury 2 is impressive but not without issues flagged by early users:
- Factual accuracy — parallel generation can produce hallucinations that don't self-correct through the sequence (autoregressive models at least have each token conditioned on all previous ones)
- Constraint satisfaction — struggles with tasks requiring strict sequential dependencies
- Not frontier-tier — if you need the absolute best reasoning, you still want Opus or GPT-5
How to Try It
Mercury 2 is available today via the Inception API. It's OpenAI API compatible, so you can point any existing client at it:
from openai import OpenAI
client = OpenAI(
base_url="https://api.inceptionlabs.ai/v1",
api_key="your-inception-key"
)
response = client.chat.completions.create(
model="mercury-2",
messages=[{"role": "user", "content": "Explain quantum computing in 3 sentences"}]
)
print(response.choices[0].message.content)
What This Means for the Industry
Diffusion LLMs represent the first serious architectural challenge to the autoregressive paradigm that has dominated since GPT-2. If Mercury 2's approach scales to frontier quality, the entire cost structure of AI inference changes.
At 10× the throughput with comparable quality, inference costs drop dramatically. For businesses running AI at scale — customer support, content generation, code assistance — this could mean 10× more queries for the same GPU budget.
We're watching this space closely. The autoregressive vs. diffusion debate is just getting started.