AI News 15 min read

Mercury 2: Hands-On With the World's Fastest Reasoning LLM

ai.rs Feb 28, 2026

Inception Labs launched Mercury 2 on February 24, claiming it's the fastest reasoning LLM available — a diffusion language model that generates text at 1,196 tokens per second, 5-10x faster than speed-optimized models like GPT-4.1 Nano and Claude 3.5 Haiku. At $0.25 per million input tokens, it's also among the cheapest.

We put those claims to the test.


The Pitch: Diffusion, Not Autoregressive

Every major LLM today — GPT, Claude, Llama, Gemini — is autoregressive: it generates tokens one at a time, left to right, each depending on all previous tokens. Mercury 2 takes a fundamentally different approach. Like Stable Diffusion for images, it starts with noise and iteratively refines all tokens in parallel.

The result, in theory: massively parallel generation that breaks the sequential bottleneck.

Autoregressive (GPT, Claude) Diffusion (Mercury 2)
Generation Sequential, token-by-token Parallel, all-at-once
TTFT Fast (200-400ms) Slower (700ms+)
Throughput Bounded by sequential nature Scales with parallelism
Cost scaling Linear with output length Sub-linear potential
Sweet spot Interactive chat, reasoning Batch, pipelines, agents

Getting Started: Two Lines of Change

Mercury 2 is fully OpenAI API-compatible. If you already use the OpenAI Python SDK, switching takes exactly two changes — the base URL and the API key:

from openai import OpenAI

client = OpenAI(
    api_key=os.environ["INCEPTION_API_KEY"],
    base_url="https://api.inceptionlabs.ai/v1",
)

That's it. Every client.chat.completions.create() call works the same as with OpenAI. No new SDK, no wrapper library, no config files. You can also use LiteLLM, AISuite, or LangChain's ChatOpenAI with a custom base_url.

Test 1: Can It Talk?

We started simple — ask it to explain itself:

response = client.chat.completions.create(
    model="mercury-2",
    messages=[{"role": "user", "content": "Explain diffusion language models in 2 sentences."}],
    max_tokens=200,
)
print(response.choices[0].message.content)

Response:

Diffusion language models generate text by iteratively denoising a noisy token sequence, much like diffusion models for images, allowing many tokens to be produced in parallel rather than one-by-one. This parallel generation makes them several times faster and less than half as costly as traditional auto-regressive LLMs while also enabling fine-grained control over schema and multimodal integration.

75 tokens in 0.64 seconds. Clean, accurate, well-structured. No hallucinations. But 117 tok/s is a far cry from the advertised 1,196. On short outputs, network round-trip dominates — the model finishes generating before the response even reaches you.

Test 2: Pushing Throughput

To see real speed, you need to request longer outputs. We asked for a detailed Flask tutorial with max_tokens=1024:

response = client.chat.completions.create(
    model="mercury-2",
    messages=[{"role": "user", "content": "Write a detailed technical tutorial about building "
               "a REST API with Python Flask. Cover routing, error handling, "
               "database integration, authentication, and deployment."}],
    max_tokens=1024,
)
Metric Value
Completion tokens 866
Wall time 1.750s
Throughput 495 tok/s

866 tokens in under two seconds. The model hit the token limit and was still going — it had more to say. At 495 tok/s end-to-end from a consumer internet connection, this is already several times faster than what you'd get from GPT-4o or Claude Sonnet.

Test 3: Streaming — Where the Speed Really Shows

Streaming reveals how diffusion models behave differently. With autoregressive models, tokens trickle in one by one — you see the response being "typed out." With Mercury 2, there's a longer pause, then tokens arrive in bursts:

stream = client.chat.completions.create(
    model="mercury-2",
    messages=[{"role": "user", "content": "Write a comprehensive guide to Python "
               "decorators with 5 examples."}],
    max_tokens=1024,
    stream=True,
    stream_options={"include_usage": True},
)

for chunk in stream:
    if chunk.choices and chunk.choices[0].delta.content:
        print(chunk.choices[0].delta.content, end="", flush=True)
Metric Value
Completion tokens 900
TTFT (time to first token) 741ms
Generation phase 1.614s
Generation speed (excl TTFT) 558 tok/s
End-to-end speed 382 tok/s

Here's the key insight: 558 tok/s during the generation phase. The 741ms time-to-first-token is higher than autoregressive models (which typically start streaming in 200-400ms), but that's because Mercury 2 does its "thinking" upfront — denoising all tokens in parallel — before emitting anything.

We received only 31 chunks for 900 tokens, meaning the API batches roughly 29 tokens per chunk. You don't see a character-by-character typewriter effect; you see paragraphs appearing in rapid bursts.

Test 4: Tool Use

Function calling is table-stakes for agentic applications. We defined a weather tool and asked about Belgrade:

tools = [{
    "type": "function",
    "function": {
        "name": "get_weather",
        "description": "Get the current weather for a location.",
        "parameters": {
            "type": "object",
            "properties": {
                "location": {"type": "string", "description": "City name"},
                "unit": {"type": "string", "enum": ["celsius", "fahrenheit"]},
            },
            "required": ["location"],
        },
    },
}]

response = client.chat.completions.create(
    model="mercury-2",
    messages=[{"role": "user", "content": "What's the weather in Belgrade?"}],
    tools=tools,
    max_tokens=200,
)

for tc in response.choices[0].message.tool_calls:
    print(f"{tc.function.name}({tc.function.arguments})")

Output:

get_weather({
  "location": "Belgrade",
  "unit": "celsius"
})

Correct function, correct arguments, and it even inferred celsius for a European city. Finished in 0.678s with finish_reason: tool_calls. This works exactly as you'd expect from the OpenAI API — no surprises, no adaptation needed.

Test 5: Structured Output

JSON mode is critical for production pipelines. We tested with response_format={"type": "json_object"}:

response = client.chat.completions.create(
    model="mercury-2",
    messages=[{
        "role": "user",
        "content": 'List 3 programming languages with their year of creation. '
                   'Return as a JSON object with a "languages" key containing '
                   'an array of objects with "name" and "year" fields.',
    }],
    response_format={"type": "json_object"},
    max_tokens=300,
)

import json
parsed = json.loads(response.choices[0].message.content)
print(json.dumps(parsed, indent=2))

Output:

{
  "languages": [
    { "name": "C", "year": 1972 },
    { "name": "Python", "year": 1991 },
    { "name": "JavaScript", "year": 1995 }
  ]
}

Valid JSON, correct schema, accurate facts. Parsed without errors. For production use, you'd want to test with more complex schemas, but the basics are solid.

Test 6: Speed Consistency

We ran the same prompt three times to check for variance:

Run Tokens Time Speed
1 308 1.189s 259 tok/s
2 262 1.090s 240 tok/s
3 286 0.902s 317 tok/s
Average 272 tok/s
Peak 317 tok/s

Variance of 240–317 tok/s is acceptable. Differences come from network jitter, server load, and the model using different numbers of diffusion steps depending on output complexity.


The Speed Gap: Advertised vs. Measured

Measurement Speed Notes
Inception's benchmark 1,196 tok/s Server-side, no network
Our best (streaming, generation only) 558 tok/s Excludes TTFT
Our best (non-streaming, end-to-end) 495 tok/s Large output
Multi-run average 272 tok/s Medium output
Short output 117 tok/s Network dominates

We measured roughly half the advertised speed. That's not a knock on Mercury 2 — it's physics. Our tests ran from a consumer internet connection through the public API. The 1,196 tok/s figure is server-side throughput measured at the inference layer, before network overhead, TLS, HTTP framing, and Python SDK parsing eat into it.

To match their number, you'd need to benchmark from co-located infrastructure (same cloud region) or measure at the GPU layer. For what it's worth, 558 tok/s over the public internet is genuinely fast — most autoregressive models top out at 50-150 tok/s in comparable conditions.


How Does It Compare? Price & Speed

Speed only matters in context. Mercury 2 competes in the "fast and cheap" tier — models you'd use for high-volume pipelines, agents, and latency-sensitive applications, not frontier reasoning. Here's how it stacks up:

Pricing Comparison

Model Input $/M Output $/M Context Architecture
Mercury 2 $0.25 $1.00 128K Diffusion
DeepSeek V3 $0.28 $0.42 128K Autoregressive (MoE)
GPT-4.1 Nano $0.10 $0.40 1M Autoregressive
GPT-4o-mini $0.15 $0.60 128K Autoregressive
Gemini 2.0 Flash $0.10 $0.40 1M Autoregressive
Claude 3.5 Haiku $0.80 $4.00 200K Autoregressive
GPT-4o $2.50 $10.00 128K Autoregressive
Claude Sonnet 4.6 $3.00 $15.00 200K Autoregressive

On input pricing, Mercury 2 is mid-pack — GPT-4.1 Nano and Gemini 2.0 Flash are cheaper at $0.10/M. On output, it's $1.00/M — more expensive than DeepSeek ($0.42) and GPT-4.1 Nano ($0.40), but far cheaper than Claude Haiku ($4.00) or any mid-tier model.

The real cost story is output-heavy workloads. If you're generating long responses (agents, code generation, content pipelines), output pricing dominates. At $1.00/M output, Mercury 2 costs:

  • 2.4x more than GPT-4.1 Nano
  • 2.5x more than Gemini 2.0 Flash
  • 4x less than Claude 3.5 Haiku
  • 10x less than GPT-4o
  • 15x less than Claude Sonnet

Speed Comparison

Model Approx. Speed (tok/s) Notes
Mercury 2 495–558 (measured) Diffusion; 1,196 server-side
Gemini 2.0 Flash ~250 Google's speed tier
DeepSeek V3 ~100–160 Varies by load
GPT-4o-mini ~100–130 OpenAI speed tier
GPT-4.1 Nano ~150–200 OpenAI's fastest
Claude 3.5 Haiku ~80–100 Anthropic speed tier
GPT-4o ~60–90 Mid-tier
Claude Sonnet 4.6 ~70–80 Mid-tier

Speed figures are approximate client-side measurements and vary by network, region, and load. Mercury 2 figures are from our testing.

Even through the public internet, Mercury 2 is 2-3x faster than the next fastest competitor (Gemini Flash at ~250 tok/s) and 5-7x faster than mid-tier models like GPT-4o and Claude Sonnet. This is where the diffusion architecture genuinely shines — it's not marketing fluff.

Cost per Million Output Tokens at Speed

A useful way to think about it: what do you pay per million output tokens, and how fast do you get them?

Model Output $/M Speed (tok/s) Time for 1M tokens Cost per hour of output
Mercury 2 $1.00 550 ~30 min ~$2.00
GPT-4.1 Nano $0.40 175 ~95 min ~$0.25
DeepSeek V3 $0.42 130 ~128 min ~$0.20
Gemini 2.0 Flash $0.40 250 ~67 min ~$0.36
Claude 3.5 Haiku $4.00 90 ~185 min ~$1.30

Mercury 2 isn't the cheapest per token, but it delivers those tokens fastest. If your bottleneck is latency — how quickly you can complete an agentic loop, respond to a user, or process a document — Mercury 2 wins decisively. If your bottleneck is pure cost and you can tolerate slower speeds, DeepSeek V3 or GPT-4.1 Nano are cheaper.


Beyond Speed: Extended Testing

We ran a second test suite covering reasoning, multi-language, multi-turn conversation, agentic tool chains, needle-in-a-haystack retrieval, and edge cases. The results surfaced both strengths and a critical quirk.

The max_tokens Trap

The most important practical finding: Mercury 2 needs generous max_tokens values or it returns empty responses.

With autoregressive models, setting max_tokens=20 means "generate up to 20 tokens, stop when you're done." The model emits tokens one by one and stops early if it finishes. Mercury 2's diffusion architecture works differently — it appears to allocate the full output buffer upfront. If that buffer is too small, the model produces empty content with finish_reason=length and tokens=0:

# This fails silently — returns empty string
response = client.chat.completions.create(
    model="mercury-2",
    messages=[{"role": "user", "content": "What is 2+2?"}],
    max_tokens=10,  # too low for diffusion model
)
print(response.choices[0].message.content)  # ""

# This works — give it room
response = client.chat.completions.create(
    model="mercury-2",
    messages=[{"role": "user", "content": "What is 2+2?"}],
    max_tokens=150,  # generous headroom
)
print(response.choices[0].message.content)  # "4"

Rule of thumb: always set max_tokens to at least 100–200, even if you expect a short answer. The model will still stop early (finish_reason=stop) when it's done — you won't waste tokens. But if you set it too low, you get nothing. This is a significant difference from autoregressive models and will bite you in production if you're migrating existing code.

Quality & Reasoning: 5/6

Test Result Details
Arithmetic (17×23+14−5) PASS Returned 400 correctly
Word problem (45−12+8) PASS Returned 41 correctly
Logic (invalid syllogism) FAIL Empty response (max_tokens issue)
Code generation (fibonacci) PASS Clean Python function, 182 chars
Instruction following (3 bullets) PASS Exactly 3 dash-prefixed bullets
Factual recall (capital of Australia) PASS Canberra

Reasoning is solid. The one failure was the logic test returning empty — likely another max_tokens casualty at 80 tokens, not a reasoning failure. Math, code, instructions, and factual recall all pass cleanly.

Multi-language: 3/3

Language Prompt Response
Serbian "Koji je glavni grad Srbije?" Beograd
German "Was ist die Hauptstadt von Deutschland?" Berlin
Japanese "日本の首都はどこですか?" 東京

Mercury 2 handles non-English prompts correctly — including Cyrillic-adjacent and CJK languages. Responses are accurate and concise.

Multi-turn Conversation: 3/3

We tested whether the model maintains context across turns:

messages = [
    {"role": "system", "content": "You are a helpful assistant. Be concise."},
    {"role": "user", "content": "My name is Marko and I live in Novi Sad."},
]
# ... assistant responds ...
messages.append({"role": "user", "content": "What is my name?"})
# → "Your name is Marko."
messages.append({"role": "user", "content": "Where do I live?"})
# → "You live in Novi Sad."

Both facts recalled correctly. We also tested persona consistency by assigning a pirate persona — Mercury 2 committed fully ("Arr, matey! Gather 'round the galley o' knowledge...") with 7 pirate-themed words in a single response.

Agentic Tool Chains: 4/4

This was the most impressive result. We defined three tools (search_product, get_price, add_to_cart) and asked Mercury 2 to find a blue t-shirt and add it to a cart:

Step 1: User asks "Find me a blue t-shirt and add it to my cart."
     → Model calls search_product(query="blue t-shirt")       ✓

Step 2: We return search results with SKU-1234
     → Model calls add_to_cart(product_id="SKU-1234")          ✓

Step 3: We confirm the cart addition
     → Model responds: "Your blue t-shirt has been added       ✓
        to your cart. Let me know if you'd like anything else."

Step 4: We return a tool error ("Service temporarily unavailable")
     → Model retries the tool call                             ✓

Four steps, four correct decisions. The model understood the task, chained tools in the right order, confirmed success in natural language, and recovered from an error by retrying. This validates Inception's pitch that Mercury 2 is built for agentic workloads.

Needle in a Haystack: 3/3

We hid the string MERCURY-FAST-7742 inside ~4,000 tokens of filler text at three positions:

Position Found?
Beginning MERCURY-FAST-7742
Middle MERCURY-FAST-7742
End MERCURY-FAST-7742

Perfect retrieval at all positions. The 128K context window handles information retrieval correctly — at least at the ~4K scale we tested.

What Didn't Work

Concurrency returned 0 tokens across all 20 requests — the max_tokens=20 setting caused every parallel request to fail silently. This needs retesting with proper headroom.

Temperature sampling also produced empty responses at max_tokens=10. We couldn't evaluate whether the diffusion model's temperature parameter behaves differently from autoregressive models — an open question for future testing.

System prompt adherence with very constrained instructions ("respond in exactly 3 words") returned empty at max_tokens=20. With adequate headroom, Mercury 2 follows system prompts well (the pirate test proves this), but the tight token budget prevented evaluation of precise constraint-following.

Extended Test Summary

Suite Score Verdict
Reasoning 5/6 Strong — math, code, facts, instructions
Multi-language 3/3 Serbian, German, Japanese all correct
Multi-turn 3/3 Memory and persona consistency
Agentic Loops 4/4 Multi-step tool chains + error recovery
Needle-in-Haystack 3/3 Perfect retrieval at all positions
Edge Cases 3/5 JSON and stop sequences work; system prompt tests need more headroom
Concurrency Inconclusive (max_tokens issue)
Sampling Inconclusive (max_tokens issue)
Total 21/24

The Bottom Line

Mercury 2 is real and it works. The OpenAI compatibility is seamless — you can swap it into an existing codebase in under a minute. Tool use, streaming, JSON mode, multi-language, multi-turn, and agentic tool chains all function correctly. It scored 21/24 on our extended test suite, with the three failures all traceable to a single cause: the diffusion model's sensitivity to low max_tokens values.

That max_tokens quirk is the one thing you must know before deploying. Set it generously (200+) even for short expected outputs, or you'll get silent empty responses. It's not a bug per se — it's how diffusion generation works — but it will catch anyone migrating from autoregressive models.

The speed advantage is genuine, though tempered by network reality. You won't see 1,196 tok/s from your laptop, but 400-550 tok/s is still 2-3x faster than the next fastest alternative. The agentic capabilities are particularly strong — multi-step tool chains with error recovery worked flawlessly, validating Inception's core pitch.

It's not the cheapest model per token (GPT-4.1 Nano and DeepSeek V3 undercut it on output pricing), and it's not the smartest (frontier models like Claude Sonnet or GPT-4o have deeper reasoning). But in the speed-to-cost ratio for production workloads, Mercury 2 occupies a unique position — and as the first commercial diffusion LLM, it represents a genuine architectural bet that the rest of the industry is watching.

Specs at a Glance:

Model mercury-2
Architecture Diffusion LLM (dLLM)
Context window 128K tokens
Max completion 16,384 tokens
Input pricing $0.25/M tokens
Output pricing $1.00/M tokens
API compatibility OpenAI-compatible
Measured throughput 495–558 tok/s (client-side)
Share: Post Share

Related Articles