Research 7 min read

Mercury 2: The First Reasoning Diffusion LLM — 1,000 Tokens/sec

ai.rs Feb 26, 2026
Mercury 2: The First Reasoning Diffusion LLM — 1,000 Tokens/sec illustration

What Is Mercury 2?

Mercury 2 is the first commercial reasoning diffusion LLM from Inception Labs. Unlike every major LLM you've used — GPT, Claude, Llama — Mercury 2 doesn't generate tokens one at a time. It uses diffusion to produce multiple tokens in parallel, then refines them over a small number of steps.

The result: ~1,000 tokens per second output throughput on NVIDIA Blackwell GPUs.

For context, Claude 4.5 Haiku outputs ~89 tok/s and GPT-5 Mini ~71 tok/s. Mercury 2 is roughly 10× faster.

How Diffusion LLMs Work

Traditional LLMs are autoregressive: they predict one token, append it, then predict the next. This is inherently sequential — each token depends on all previous tokens.

Diffusion LLMs take a fundamentally different approach borrowed from image generation (Stable Diffusion, DALL-E):

  1. Start with noise — begin with a block of random tokens
  2. Refine in parallel — iteratively denoise all tokens simultaneously
  3. Converge — after a small number of refinement steps, the output is coherent text

This is called block diffusion. Because tokens are generated in parallel rather than sequentially, GPU utilization skyrockets — you're doing useful compute across all cores simultaneously instead of waiting for one token at a time.

Autoregressive (traditional):
  Token 1 → Token 2 → Token 3 → Token 4 → ...
  [sequential, ~100 tok/s]

Diffusion (Mercury 2):
  [noise] → [rough draft] → [refined] → [final output]
  [parallel, ~1,000 tok/s]

Benchmarks

Mercury 2 positions as a fast reasoning model — comparable to Claude 4.5 Haiku and GPT-5 Mini in quality, but dramatically faster:

Benchmark Mercury 2 Claude 4.5 Haiku GPT-5 Mini
AIME 2025 91.1 ~90 ~88
GPQA 73.6 ~75 ~72
LiveCodeBench 67.3 ~65 ~63
IFBench 71.3
Output speed ~1,000 tok/s ~89 tok/s ~71 tok/s

This isn't competing with frontier models like Claude Opus or GPT-5 on the hardest reasoning tasks. It's targeting the fast agent tier — where speed matters more than peak intelligence.

Key Features

  • 128K context window — handles large codebases and documents
  • Tunable reasoning — adjust the quality/speed tradeoff per request
  • Native tool use — function calling built in, not bolted on
  • Schema-aligned JSON output — structured output without post-processing
  • OpenAI API compatible — drop-in replacement, no code rewrites needed

Where This Matters: Agentic Workflows

The real impact isn't chat. It's agentic loops where an LLM runs hundreds of iterations:

  • Code generation pipelines — write, test, fix, repeat. At 1,000 tok/s, each iteration takes seconds instead of minutes
  • Multi-step reasoning — chain-of-thought that would take 30 seconds now takes 3
  • Real-time applications — live coding assistants, interactive debugging, instant analysis

A developer on Hacker News proposed "intelligence per second" as the metric that matters: throughput × reasoning quality. Mercury 2 optimizes exactly this.

Hybrid Architecture Potential

The most interesting use case discussed in the community: frontier model for planning, diffusion model for execution.

Use Claude Opus or GPT-5 to create a high-level plan, then hand off to Mercury 2 for rapid iteration on individual steps. You get the best reasoning where it matters and maximum speed everywhere else.

Known Limitations

Mercury 2 is impressive but not without issues flagged by early users:

  • Factual accuracy — parallel generation can produce hallucinations that don't self-correct through the sequence (autoregressive models at least have each token conditioned on all previous ones)
  • Constraint satisfaction — struggles with tasks requiring strict sequential dependencies
  • Not frontier-tier — if you need the absolute best reasoning, you still want Opus or GPT-5

How to Try It

Mercury 2 is available today via the Inception API. It's OpenAI API compatible, so you can point any existing client at it:

from openai import OpenAI

client = OpenAI(
    base_url="https://api.inceptionlabs.ai/v1",
    api_key="your-inception-key"
)

response = client.chat.completions.create(
    model="mercury-2",
    messages=[{"role": "user", "content": "Explain quantum computing in 3 sentences"}]
)
print(response.choices[0].message.content)

What This Means for the Industry

Diffusion LLMs represent the first serious architectural challenge to the autoregressive paradigm that has dominated since GPT-2. If Mercury 2's approach scales to frontier quality, the entire cost structure of AI inference changes.

At 10× the throughput with comparable quality, inference costs drop dramatically. For businesses running AI at scale — customer support, content generation, code assistance — this could mean 10× more queries for the same GPU budget.

We're watching this space closely. The autoregressive vs. diffusion debate is just getting started.

Wondering if this fits your business?

The benchmarks and architectures we cover here power real production AI assistants. See where your business stands in 2 minutes.

Take the AI Readiness Check
Share: Post Share

Read next