Research 7 min read

Mercury 2: The First Reasoning Diffusion LLM — 1,000 Tokens/sec

ai.rs Feb 26, 2026

What Is Mercury 2?

Mercury 2 is the first commercial reasoning diffusion LLM from Inception Labs. Unlike every major LLM you've used — GPT, Claude, Llama — Mercury 2 doesn't generate tokens one at a time. It uses diffusion to produce multiple tokens in parallel, then refines them over a small number of steps.

The result: ~1,000 tokens per second output throughput on NVIDIA Blackwell GPUs.

For context, Claude 4.5 Haiku outputs ~89 tok/s and GPT-5 Mini ~71 tok/s. Mercury 2 is roughly 10× faster.

How Diffusion LLMs Work

Traditional LLMs are autoregressive: they predict one token, append it, then predict the next. This is inherently sequential — each token depends on all previous tokens.

Diffusion LLMs take a fundamentally different approach borrowed from image generation (Stable Diffusion, DALL-E):

  1. Start with noise — begin with a block of random tokens
  2. Refine in parallel — iteratively denoise all tokens simultaneously
  3. Converge — after a small number of refinement steps, the output is coherent text

This is called block diffusion. Because tokens are generated in parallel rather than sequentially, GPU utilization skyrockets — you're doing useful compute across all cores simultaneously instead of waiting for one token at a time.

Autoregressive (traditional):
  Token 1 → Token 2 → Token 3 → Token 4 → ...
  [sequential, ~100 tok/s]

Diffusion (Mercury 2):
  [noise] → [rough draft] → [refined] → [final output]
  [parallel, ~1,000 tok/s]

Benchmarks

Mercury 2 positions as a fast reasoning model — comparable to Claude 4.5 Haiku and GPT-5 Mini in quality, but dramatically faster:

Benchmark Mercury 2 Claude 4.5 Haiku GPT-5 Mini
AIME 2025 91.1 ~90 ~88
GPQA 73.6 ~75 ~72
LiveCodeBench 67.3 ~65 ~63
IFBench 71.3
Output speed ~1,000 tok/s ~89 tok/s ~71 tok/s

This isn't competing with frontier models like Claude Opus or GPT-5 on the hardest reasoning tasks. It's targeting the fast agent tier — where speed matters more than peak intelligence.

Key Features

  • 128K context window — handles large codebases and documents
  • Tunable reasoning — adjust the quality/speed tradeoff per request
  • Native tool use — function calling built in, not bolted on
  • Schema-aligned JSON output — structured output without post-processing
  • OpenAI API compatible — drop-in replacement, no code rewrites needed

Where This Matters: Agentic Workflows

The real impact isn't chat. It's agentic loops where an LLM runs hundreds of iterations:

  • Code generation pipelines — write, test, fix, repeat. At 1,000 tok/s, each iteration takes seconds instead of minutes
  • Multi-step reasoning — chain-of-thought that would take 30 seconds now takes 3
  • Real-time applications — live coding assistants, interactive debugging, instant analysis

A developer on Hacker News proposed "intelligence per second" as the metric that matters: throughput × reasoning quality. Mercury 2 optimizes exactly this.

Hybrid Architecture Potential

The most interesting use case discussed in the community: frontier model for planning, diffusion model for execution.

Use Claude Opus or GPT-5 to create a high-level plan, then hand off to Mercury 2 for rapid iteration on individual steps. You get the best reasoning where it matters and maximum speed everywhere else.

Known Limitations

Mercury 2 is impressive but not without issues flagged by early users:

  • Factual accuracy — parallel generation can produce hallucinations that don't self-correct through the sequence (autoregressive models at least have each token conditioned on all previous ones)
  • Constraint satisfaction — struggles with tasks requiring strict sequential dependencies
  • Not frontier-tier — if you need the absolute best reasoning, you still want Opus or GPT-5

How to Try It

Mercury 2 is available today via the Inception API. It's OpenAI API compatible, so you can point any existing client at it:

from openai import OpenAI

client = OpenAI(
    base_url="https://api.inceptionlabs.ai/v1",
    api_key="your-inception-key"
)

response = client.chat.completions.create(
    model="mercury-2",
    messages=[{"role": "user", "content": "Explain quantum computing in 3 sentences"}]
)
print(response.choices[0].message.content)

What This Means for the Industry

Diffusion LLMs represent the first serious architectural challenge to the autoregressive paradigm that has dominated since GPT-2. If Mercury 2's approach scales to frontier quality, the entire cost structure of AI inference changes.

At 10× the throughput with comparable quality, inference costs drop dramatically. For businesses running AI at scale — customer support, content generation, code assistance — this could mean 10× more queries for the same GPU budget.

We're watching this space closely. The autoregressive vs. diffusion debate is just getting started.

Share: Post Share

Related Articles