AI News 15 min read

Mercury 2: Hands-On With the World's Fastest Reasoning LLM

ai.rs Feb 28, 2026

llm diffusion benchmarks api tools

Inception Labs launched Mercury 2 on February 24, claiming it's the fastest reasoning LLM available — a diffusion language model that generates text at 1,196 tokens per second, 5-10x faster than speed-optimized models like GPT-4.1 Nano and Claude 3.5 Haiku. At $0.25 per million input tokens, it's also among the cheapest.

We put those claims to the test.

The Pitch: Diffusion, Not Autoregressive

Every major LLM today — GPT, Claude, Llama, Gemini — is autoregressive: it generates tokens one at a time, left to right, each depending on all previous tokens. Mercury 2 takes a fundamentally different approach. Like Stable Diffusion for images, it starts with noise and iteratively refines all tokens in parallel.

The result, in theory: massively parallel generation that breaks the sequential bottleneck.

	Autoregressive (GPT, Claude)	Diffusion (Mercury 2)
Generation	Sequential, token-by-token	Parallel, all-at-once
TTFT	Fast (200-400ms)	Slower (700ms+)
Throughput	Bounded by sequential nature	Scales with parallelism
Cost scaling	Linear with output length	Sub-linear potential
Sweet spot	Interactive chat, reasoning	Batch, pipelines, agents

Getting Started: Two Lines of Change

Mercury 2 is fully OpenAI API-compatible. If you already use the OpenAI Python SDK, switching takes exactly two changes — the base URL and the API key:

from openai import OpenAI

client = OpenAI(
    api_key=os.environ["INCEPTION_API_KEY"],
    base_url="https://api.inceptionlabs.ai/v1",
)

That's it. Every client.chat.completions.create() call works the same as with OpenAI. No new SDK, no wrapper library, no config files. You can also use LiteLLM, AISuite, or LangChain's ChatOpenAI with a custom base_url.

Test 1: Can It Talk?

We started simple — ask it to explain itself:

response = client.chat.completions.create(
    model="mercury-2",
    messages=[{"role": "user", "content": "Explain diffusion language models in 2 sentences."}],
    max_tokens=200,
)
print(response.choices[0].message.content)

Response:

Diffusion language models generate text by iteratively denoising a noisy token sequence, much like diffusion models for images, allowing many tokens to be produced in parallel rather than one-by-one. This parallel generation makes them several times faster and less than half as costly as traditional auto-regressive LLMs while also enabling fine-grained control over schema and multimodal integration.

75 tokens in 0.64 seconds. Clean, accurate, well-structured. No hallucinations. But 117 tok/s is a far cry from the advertised 1,196. On short outputs, network round-trip dominates — the model finishes generating before the response even reaches you.

Test 2: Pushing Throughput

To see real speed, you need to request longer outputs. We asked for a detailed Flask tutorial with max_tokens=1024:

response = client.chat.completions.create(
    model="mercury-2",
    messages=[{"role": "user", "content": "Write a detailed technical tutorial about building "
               "a REST API with Python Flask. Cover routing, error handling, "
               "database integration, authentication, and deployment."}],
    max_tokens=1024,
)

Metric	Value
Completion tokens	866
Wall time	1.750s
Throughput	495 tok/s

866 tokens in under two seconds. The model hit the token limit and was still going — it had more to say. At 495 tok/s end-to-end from a consumer internet connection, this is already several times faster than what you'd get from GPT-4o or Claude Sonnet.

Test 3: Streaming — Where the Speed Really Shows

Streaming reveals how diffusion models behave differently. With autoregressive models, tokens trickle in one by one — you see the response being "typed out." With Mercury 2, there's a longer pause, then tokens arrive in bursts:

stream = client.chat.completions.create(
    model="mercury-2",
    messages=[{"role": "user", "content": "Write a comprehensive guide to Python "
               "decorators with 5 examples."}],
    max_tokens=1024,
    stream=True,
    stream_options={"include_usage": True},
)

for chunk in stream:
    if chunk.choices and chunk.choices[0].delta.content:
        print(chunk.choices[0].delta.content, end="", flush=True)

Metric	Value
Completion tokens	900
TTFT (time to first token)	741ms
Generation phase	1.614s
Generation speed (excl TTFT)	558 tok/s
End-to-end speed	382 tok/s

Here's the key insight: 558 tok/s during the generation phase. The 741ms time-to-first-token is higher than autoregressive models (which typically start streaming in 200-400ms), but that's because Mercury 2 does its "thinking" upfront — denoising all tokens in parallel — before emitting anything.

We received only 31 chunks for 900 tokens, meaning the API batches roughly 29 tokens per chunk. You don't see a character-by-character typewriter effect; you see paragraphs appearing in rapid bursts.

Test 4: Tool Use

Function calling is table-stakes for agentic applications. We defined a weather tool and asked about Belgrade:

tools = [{
    "type": "function",
    "function": {
        "name": "get_weather",
        "description": "Get the current weather for a location.",
        "parameters": {
            "type": "object",
            "properties": {
                "location": {"type": "string", "description": "City name"},
                "unit": {"type": "string", "enum": ["celsius", "fahrenheit"]},
            },
            "required": ["location"],
        },
    },
}]

response = client.chat.completions.create(
    model="mercury-2",
    messages=[{"role": "user", "content": "What's the weather in Belgrade?"}],
    tools=tools,
    max_tokens=200,
)

for tc in response.choices[0].message.tool_calls:
    print(f"{tc.function.name}({tc.function.arguments})")

Output:

get_weather({
  "location": "Belgrade",
  "unit": "celsius"
})

Correct function, correct arguments, and it even inferred celsius for a European city. Finished in 0.678s with finish_reason: tool_calls. This works exactly as you'd expect from the OpenAI API — no surprises, no adaptation needed.

Test 5: Structured Output

JSON mode is critical for production pipelines. We tested with response_format={"type": "json_object"}:

response = client.chat.completions.create(
    model="mercury-2",
    messages=[{
        "role": "user",
        "content": 'List 3 programming languages with their year of creation. '
                   'Return as a JSON object with a "languages" key containing '
                   'an array of objects with "name" and "year" fields.',
    }],
    response_format={"type": "json_object"},
    max_tokens=300,
)

import json
parsed = json.loads(response.choices[0].message.content)
print(json.dumps(parsed, indent=2))

Output:

{
  "languages": [
    { "name": "C", "year": 1972 },
    { "name": "Python", "year": 1991 },
    { "name": "JavaScript", "year": 1995 }
  ]
}

Valid JSON, correct schema, accurate facts. Parsed without errors. For production use, you'd want to test with more complex schemas, but the basics are solid.

Test 6: Speed Consistency

We ran the same prompt three times to check for variance:

Run	Tokens	Time	Speed
1	308	1.189s	259 tok/s
2	262	1.090s	240 tok/s
3	286	0.902s	317 tok/s
Average			272 tok/s
Peak			317 tok/s

Variance of 240–317 tok/s is acceptable. Differences come from network jitter, server load, and the model using different numbers of diffusion steps depending on output complexity.

The Speed Gap: Advertised vs. Measured

Measurement	Speed	Notes
Inception's benchmark	1,196 tok/s	Server-side, no network
Our best (streaming, generation only)	558 tok/s	Excludes TTFT
Our best (non-streaming, end-to-end)	495 tok/s	Large output
Multi-run average	272 tok/s	Medium output
Short output	117 tok/s	Network dominates

We measured roughly half the advertised speed. That's not a knock on Mercury 2 — it's physics. Our tests ran from a consumer internet connection through the public API. The 1,196 tok/s figure is server-side throughput measured at the inference layer, before network overhead, TLS, HTTP framing, and Python SDK parsing eat into it.

To match their number, you'd need to benchmark from co-located infrastructure (same cloud region) or measure at the GPU layer. For what it's worth, 558 tok/s over the public internet is genuinely fast — most autoregressive models top out at 50-150 tok/s in comparable conditions.

How Does It Compare? Price & Speed

Speed only matters in context. Mercury 2 competes in the "fast and cheap" tier — models you'd use for high-volume pipelines, agents, and latency-sensitive applications, not frontier reasoning. Here's how it stacks up:

Pricing Comparison

Model	Input $/M	Output $/M	Context	Architecture
Mercury 2	$0.25	$1.00	128K	Diffusion
DeepSeek V3	$0.28	$0.42	128K	Autoregressive (MoE)
GPT-4.1 Nano	$0.10	$0.40	1M	Autoregressive
GPT-4o-mini	$0.15	$0.60	128K	Autoregressive
Gemini 2.0 Flash	$0.10	$0.40	1M	Autoregressive
Claude 3.5 Haiku	$0.80	$4.00	200K	Autoregressive
GPT-4o	$2.50	$10.00	128K	Autoregressive
Claude Sonnet 4.6	$3.00	$15.00	200K	Autoregressive

On input pricing, Mercury 2 is mid-pack — GPT-4.1 Nano and Gemini 2.0 Flash are cheaper at $0.10/M. On output, it's $1.00/M — more expensive than DeepSeek ($0.42) and GPT-4.1 Nano ($0.40), but far cheaper than Claude Haiku ($4.00) or any mid-tier model.

The real cost story is output-heavy workloads. If you're generating long responses (agents, code generation, content pipelines), output pricing dominates. At $1.00/M output, Mercury 2 costs:

2.4x more than GPT-4.1 Nano
2.5x more than Gemini 2.0 Flash
4x less than Claude 3.5 Haiku
10x less than GPT-4o
15x less than Claude Sonnet

Speed Comparison

Model	Approx. Speed (tok/s)	Notes
Mercury 2	495–558 (measured)	Diffusion; 1,196 server-side
Gemini 2.0 Flash	~250	Google's speed tier
DeepSeek V3	~100–160	Varies by load
GPT-4o-mini	~100–130	OpenAI speed tier
GPT-4.1 Nano	~150–200	OpenAI's fastest
Claude 3.5 Haiku	~80–100	Anthropic speed tier
GPT-4o	~60–90	Mid-tier
Claude Sonnet 4.6	~70–80	Mid-tier

Speed figures are approximate client-side measurements and vary by network, region, and load. Mercury 2 figures are from our testing.

Even through the public internet, Mercury 2 is 2-3x faster than the next fastest competitor (Gemini Flash at ~250 tok/s) and 5-7x faster than mid-tier models like GPT-4o and Claude Sonnet. This is where the diffusion architecture genuinely shines — it's not marketing fluff.

Cost per Million Output Tokens at Speed

A useful way to think about it: what do you pay per million output tokens, and how fast do you get them?

Model	Output $/M	Speed (tok/s)	Time for 1M tokens	Cost per hour of output
Mercury 2	$1.00	550	~30 min	~$2.00
GPT-4.1 Nano	$0.40	175	~95 min	~$0.25
DeepSeek V3	$0.42	130	~128 min	~$0.20
Gemini 2.0 Flash	$0.40	250	~67 min	~$0.36
Claude 3.5 Haiku	$4.00	90	~185 min	~$1.30

Mercury 2 isn't the cheapest per token, but it delivers those tokens fastest. If your bottleneck is latency — how quickly you can complete an agentic loop, respond to a user, or process a document — Mercury 2 wins decisively. If your bottleneck is pure cost and you can tolerate slower speeds, DeepSeek V3 or GPT-4.1 Nano are cheaper.

Beyond Speed: Extended Testing

We ran a second test suite covering reasoning, multi-language, multi-turn conversation, agentic tool chains, needle-in-a-haystack retrieval, and edge cases. The results surfaced both strengths and a critical quirk.

The max_tokens Trap

The most important practical finding: Mercury 2 needs generous max_tokens values or it returns empty responses.

With autoregressive models, setting max_tokens=20 means "generate up to 20 tokens, stop when you're done." The model emits tokens one by one and stops early if it finishes. Mercury 2's diffusion architecture works differently — it appears to allocate the full output buffer upfront. If that buffer is too small, the model produces empty content with finish_reason=length and tokens=0:

# This fails silently — returns empty string
response = client.chat.completions.create(
    model="mercury-2",
    messages=[{"role": "user", "content": "What is 2+2?"}],
    max_tokens=10,  # too low for diffusion model
)
print(response.choices[0].message.content)  # ""

# This works — give it room
response = client.chat.completions.create(
    model="mercury-2",
    messages=[{"role": "user", "content": "What is 2+2?"}],
    max_tokens=150,  # generous headroom
)
print(response.choices[0].message.content)  # "4"

Rule of thumb: always set max_tokens to at least 100–200, even if you expect a short answer. The model will still stop early (finish_reason=stop) when it's done — you won't waste tokens. But if you set it too low, you get nothing. This is a significant difference from autoregressive models and will bite you in production if you're migrating existing code.

Quality & Reasoning: 5/6

Test	Result	Details
Arithmetic (17×23+14−5)	PASS	Returned `400` correctly
Word problem (45−12+8)	PASS	Returned `41` correctly
Logic (invalid syllogism)	FAIL	Empty response (max_tokens issue)
Code generation (fibonacci)	PASS	Clean Python function, 182 chars
Instruction following (3 bullets)	PASS	Exactly 3 dash-prefixed bullets
Factual recall (capital of Australia)	PASS	`Canberra`

Reasoning is solid. The one failure was the logic test returning empty — likely another max_tokens casualty at 80 tokens, not a reasoning failure. Math, code, instructions, and factual recall all pass cleanly.

Multi-language: 3/3

Language	Prompt	Response
Serbian	"Koji je glavni grad Srbije?"	Beograd
German	"Was ist die Hauptstadt von Deutschland?"	Berlin
Japanese	"日本の首都はどこですか？"	東京

Mercury 2 handles non-English prompts correctly — including Cyrillic-adjacent and CJK languages. Responses are accurate and concise.

Multi-turn Conversation: 3/3

We tested whether the model maintains context across turns:

messages = [
    {"role": "system", "content": "You are a helpful assistant. Be concise."},
    {"role": "user", "content": "My name is Marko and I live in Novi Sad."},
]
# ... assistant responds ...
messages.append({"role": "user", "content": "What is my name?"})
# → "Your name is Marko."
messages.append({"role": "user", "content": "Where do I live?"})
# → "You live in Novi Sad."

Both facts recalled correctly. We also tested persona consistency by assigning a pirate persona — Mercury 2 committed fully ("Arr, matey! Gather 'round the galley o' knowledge...") with 7 pirate-themed words in a single response.

Agentic Tool Chains: 4/4

This was the most impressive result. We defined three tools (search_product, get_price, add_to_cart) and asked Mercury 2 to find a blue t-shirt and add it to a cart:

Step 1: User asks "Find me a blue t-shirt and add it to my cart."
     → Model calls search_product(query="blue t-shirt")       ✓

Step 2: We return search results with SKU-1234
     → Model calls add_to_cart(product_id="SKU-1234")          ✓

Step 3: We confirm the cart addition
     → Model responds: "Your blue t-shirt has been added       ✓
        to your cart. Let me know if you'd like anything else."

Step 4: We return a tool error ("Service temporarily unavailable")
     → Model retries the tool call                             ✓

Four steps, four correct decisions. The model understood the task, chained tools in the right order, confirmed success in natural language, and recovered from an error by retrying. This validates Inception's pitch that Mercury 2 is built for agentic workloads.

Needle in a Haystack: 3/3

We hid the string MERCURY-FAST-7742 inside ~4,000 tokens of filler text at three positions:

Position	Found?
Beginning	MERCURY-FAST-7742
Middle	MERCURY-FAST-7742
End	MERCURY-FAST-7742

Perfect retrieval at all positions. The 128K context window handles information retrieval correctly — at least at the ~4K scale we tested.

What Didn't Work

Concurrency returned 0 tokens across all 20 requests — the max_tokens=20 setting caused every parallel request to fail silently. This needs retesting with proper headroom.

Temperature sampling also produced empty responses at max_tokens=10. We couldn't evaluate whether the diffusion model's temperature parameter behaves differently from autoregressive models — an open question for future testing.

System prompt adherence with very constrained instructions ("respond in exactly 3 words") returned empty at max_tokens=20. With adequate headroom, Mercury 2 follows system prompts well (the pirate test proves this), but the tight token budget prevented evaluation of precise constraint-following.

Extended Test Summary

Suite	Score	Verdict
Reasoning	5/6	Strong — math, code, facts, instructions
Multi-language	3/3	Serbian, German, Japanese all correct
Multi-turn	3/3	Memory and persona consistency
Agentic Loops	4/4	Multi-step tool chains + error recovery
Needle-in-Haystack	3/3	Perfect retrieval at all positions
Edge Cases	3/5	JSON and stop sequences work; system prompt tests need more headroom
Concurrency	—	Inconclusive (max_tokens issue)
Sampling	—	Inconclusive (max_tokens issue)
Total	21/24

The Bottom Line

Mercury 2 is real and it works. The OpenAI compatibility is seamless — you can swap it into an existing codebase in under a minute. Tool use, streaming, JSON mode, multi-language, multi-turn, and agentic tool chains all function correctly. It scored 21/24 on our extended test suite, with the three failures all traceable to a single cause: the diffusion model's sensitivity to low max_tokens values.

That max_tokens quirk is the one thing you must know before deploying. Set it generously (200+) even for short expected outputs, or you'll get silent empty responses. It's not a bug per se — it's how diffusion generation works — but it will catch anyone migrating from autoregressive models.

The speed advantage is genuine, though tempered by network reality. You won't see 1,196 tok/s from your laptop, but 400-550 tok/s is still 2-3x faster than the next fastest alternative. The agentic capabilities are particularly strong — multi-step tool chains with error recovery worked flawlessly, validating Inception's core pitch.

It's not the cheapest model per token (GPT-4.1 Nano and DeepSeek V3 undercut it on output pricing), and it's not the smartest (frontier models like Claude Sonnet or GPT-4o have deeper reasoning). But in the speed-to-cost ratio for production workloads, Mercury 2 occupies a unique position — and as the first commercial diffusion LLM, it represents a genuine architectural bet that the rest of the industry is watching.

Specs at a Glance:


Model	`mercury-2`
Architecture	Diffusion LLM (dLLM)
Context window	128K tokens
Max completion	16,384 tokens
Input pricing	$0.25/M tokens
Output pricing	$1.00/M tokens
API compatibility	OpenAI-compatible
Measured throughput	495–558 tok/s (client-side)

Share: Post Share

Fundamentals

What is an LLM and How to Deploy It on Your Website

A practical guide to understanding large language models and deploying them as intelligent assistants for your business website.

8 min read