Research 7 min read

Mercury 2: The First Reasoning Diffusion LLM — 1,000 Tokens/sec

ai.rs Feb 26, 2026

llm diffusion reasoning inference benchmarks

What Is Mercury 2?

Mercury 2 is the first commercial reasoning diffusion LLM from Inception Labs. Unlike every major LLM you've used — GPT, Claude, Llama — Mercury 2 doesn't generate tokens one at a time. It uses diffusion to produce multiple tokens in parallel, then refines them over a small number of steps.

The result: ~1,000 tokens per second output throughput on NVIDIA Blackwell GPUs.

For context, Claude 4.5 Haiku outputs ~89 tok/s and GPT-5 Mini ~71 tok/s. Mercury 2 is roughly 10× faster.

How Diffusion LLMs Work

Traditional LLMs are autoregressive: they predict one token, append it, then predict the next. This is inherently sequential — each token depends on all previous tokens.

Diffusion LLMs take a fundamentally different approach borrowed from image generation (Stable Diffusion, DALL-E):

Start with noise — begin with a block of random tokens
Refine in parallel — iteratively denoise all tokens simultaneously
Converge — after a small number of refinement steps, the output is coherent text

This is called block diffusion. Because tokens are generated in parallel rather than sequentially, GPU utilization skyrockets — you're doing useful compute across all cores simultaneously instead of waiting for one token at a time.

Autoregressive (traditional):
  Token 1 → Token 2 → Token 3 → Token 4 → ...
  [sequential, ~100 tok/s]

Diffusion (Mercury 2):
  [noise] → [rough draft] → [refined] → [final output]
  [parallel, ~1,000 tok/s]

Benchmarks

Mercury 2 positions as a fast reasoning model — comparable to Claude 4.5 Haiku and GPT-5 Mini in quality, but dramatically faster:

Benchmark	Mercury 2	Claude 4.5 Haiku	GPT-5 Mini
AIME 2025	91.1	~90	~88
GPQA	73.6	~75	~72
LiveCodeBench	67.3	~65	~63
IFBench	71.3	—	—
Output speed	~1,000 tok/s	~89 tok/s	~71 tok/s

This isn't competing with frontier models like Claude Opus or GPT-5 on the hardest reasoning tasks. It's targeting the fast agent tier — where speed matters more than peak intelligence.

Key Features

128K context window — handles large codebases and documents
Tunable reasoning — adjust the quality/speed tradeoff per request
Native tool use — function calling built in, not bolted on
Schema-aligned JSON output — structured output without post-processing
OpenAI API compatible — drop-in replacement, no code rewrites needed

Where This Matters: Agentic Workflows

The real impact isn't chat. It's agentic loops where an LLM runs hundreds of iterations:

Code generation pipelines — write, test, fix, repeat. At 1,000 tok/s, each iteration takes seconds instead of minutes
Multi-step reasoning — chain-of-thought that would take 30 seconds now takes 3
Real-time applications — live coding assistants, interactive debugging, instant analysis

A developer on Hacker News proposed "intelligence per second" as the metric that matters: throughput × reasoning quality. Mercury 2 optimizes exactly this.

Hybrid Architecture Potential

The most interesting use case discussed in the community: frontier model for planning, diffusion model for execution.

Use Claude Opus or GPT-5 to create a high-level plan, then hand off to Mercury 2 for rapid iteration on individual steps. You get the best reasoning where it matters and maximum speed everywhere else.

Known Limitations

Mercury 2 is impressive but not without issues flagged by early users:

Factual accuracy — parallel generation can produce hallucinations that don't self-correct through the sequence (autoregressive models at least have each token conditioned on all previous ones)
Constraint satisfaction — struggles with tasks requiring strict sequential dependencies
Not frontier-tier — if you need the absolute best reasoning, you still want Opus or GPT-5

How to Try It

Mercury 2 is available today via the Inception API. It's OpenAI API compatible, so you can point any existing client at it:

from openai import OpenAI

client = OpenAI(
    base_url="https://api.inceptionlabs.ai/v1",
    api_key="your-inception-key"
)

response = client.chat.completions.create(
    model="mercury-2",
    messages=[{"role": "user", "content": "Explain quantum computing in 3 sentences"}]
)
print(response.choices[0].message.content)

What This Means for the Industry

Diffusion LLMs represent the first serious architectural challenge to the autoregressive paradigm that has dominated since GPT-2. If Mercury 2's approach scales to frontier quality, the entire cost structure of AI inference changes.

At 10× the throughput with comparable quality, inference costs drop dramatically. For businesses running AI at scale — customer support, content generation, code assistance — this could mean 10× more queries for the same GPU budget.

We're watching this space closely. The autoregressive vs. diffusion debate is just getting started.

Wondering if this fits your business?

The benchmarks and architectures we cover here power real production AI assistants. See where your business stands in 2 minutes.

Take the AI Readiness Check

Share: Post Share

Mercury 2: The First Reasoning Diffusion LLM — 1,000 Tokens/sec

What Is Mercury 2?

How Diffusion LLMs Work

Benchmarks

Key Features

Where This Matters: Agentic Workflows

Hybrid Architecture Potential

Known Limitations

How to Try It

What This Means for the Industry

Wondering if this fits your business?

Read next

Gemma 4 LoRA Fine-Tuning on RTX 5090: What Works and What Doesn't

Mercury 2: Hands-On With the World's Fastest Reasoning LLM

Qwen 3.5: 35B Knowledge at 4B Speed — Better Than GPT-5?

On This Page

Developer Corner