AI News 15 min read

Mercury 2: Hands-On With the World's Fastest Reasoning LLM

ai.rs Feb 28, 2026

Inception Labs launched Mercury 2 on February 24, claiming it's the fastest reasoning LLM available — a diffusion language model that generates text at 1,196 tokens per second, 5-10x faster than speed-optimized models like GPT-4.1 Nano and Claude 3.5 Haiku. At $0.25 per million input tokens, it's also among the cheapest.

We put those claims to the test.


The Pitch: Diffusion, Not Autoregressive

Every major LLM today — GPT, Claude, Llama, Gemini — is autoregressive: it generates tokens one at a time, left to right, each depending on all previous tokens. Mercury 2 takes a fundamentally different approach. Like Stable Diffusion for images, it starts with noise and iteratively refines all tokens in parallel.

The result, in theory: massively parallel generation that breaks the sequential bottleneck.

Autoregressive (GPT, Claude) Diffusion (Mercury 2)
Generation Sequential, token-by-token Parallel, all-at-once
TTFT Fast (200-400ms) Slower (700ms+)
Throughput Bounded by sequential nature Scales with parallelism
Cost scaling Linear with output length Sub-linear potential
Sweet spot Interactive chat, reasoning Batch, pipelines, agents

Getting Started: Two Lines of Change

Mercury 2 is fully OpenAI API-compatible. If you already use the OpenAI Python SDK, switching takes exactly two changes — the base URL and the API key:

from openai import OpenAI

client = OpenAI(
    api_key=os.environ["INCEPTION_API_KEY"],
    base_url="https://api.inceptionlabs.ai/v1",
)

That's it. Every client.chat.completions.create() call works the same as with OpenAI. No new SDK, no wrapper library, no config files. You can also use LiteLLM, AISuite, or LangChain's ChatOpenAI with a custom base_url.

Test 1: Can It Talk?

We started simple — ask it to explain itself:

response = client.chat.completions.create(
    model="mercury-2",
    messages=[{"role": "user", "content": "Explain diffusion language models in 2 sentences."}],
    max_tokens=200,
)
print(response.choices[0].message.content)

Response:

Diffusion language models generate text by iteratively denoising a noisy token sequence, much like diffusion models for images, allowing many tokens to be produced in parallel rather than one-by-one. This parallel generation makes them several times faster and less than half as costly as traditional auto-regressive LLMs while also enabling fine-grained control over schema and multimodal integration.

75 tokens in 0.64 seconds. Clean, accurate, well-structured. No hallucinations. But 117 tok/s is a far cry from the advertised 1,196. On short outputs, network round-trip dominates — the model finishes generating before the response even reaches you.

Test 2: Pushing Throughput

To see real speed, you need to request longer outputs. We asked for a detailed Flask tutorial with max_tokens=1024:

response = client.chat.completions.create(
    model="mercury-2",
    messages=[{"role": "user", "content": "Write a detailed technical tutorial about building "
               "a REST API with Python Flask. Cover routing, error handling, "
               "database integration, authentication, and deployment."}],
    max_tokens=1024,
)
Metric Value
Completion tokens 866
Wall time 1.750s
Throughput 495 tok/s

866 tokens in under two seconds. The model hit the token limit and was still going — it had more to say. At 495 tok/s end-to-end from a consumer internet connection, this is already several times faster than what you'd get from GPT-4o or Claude Sonnet.

Test 3: Streaming — Where the Speed Really Shows

Streaming reveals how diffusion models behave differently. With autoregressive models, tokens trickle in one by one — you see the response being "typed out." With Mercury 2, there's a longer pause, then tokens arrive in bursts:

stream = client.chat.completions.create(
    model="mercury-2",
    messages=[{"role": "user", "content": "Write a comprehensive guide to Python "
               "decorators with 5 examples."}],
    max_tokens=1024,
    stream=True,
    stream_options={"include_usage": True},
)

for chunk in stream:
    if chunk.choices and chunk.choices[0].delta.content:
        print(chunk.choices[0].delta.content, end="", flush=True)
Metric Value
Completion tokens 900
TTFT (time to first token) 741ms
Generation phase 1.614s
Generation speed (excl TTFT) 558 tok/s
End-to-end speed 382 tok/s

Here's the key insight: 558 tok/s during the generation phase. The 741ms time-to-first-token is higher than autoregressive models (which typically start streaming in 200-400ms), but that's because Mercury 2 does its "thinking" upfront — denoising all tokens in parallel — before emitting anything.

We received only 31 chunks for 900 tokens, meaning the API batches roughly 29 tokens per chunk. You don't see a character-by-character typewriter effect; you see paragraphs appearing in rapid bursts.

Test 4: Tool Use

Function calling is table-stakes for agentic applications. We defined a weather tool and asked about Belgrade:

tools = [{
    "type": "function",
    "function": {
        "name": "get_weather",
        "description": "Get the current weather for a location.",
        "parameters": {
            "type": "object",
            "properties": {
                "location": {"type": "string", "description": "City name"},
                "unit": {"type": "string", "enum": ["celsius", "fahrenheit"]},
            },
            "required": ["location"],
        },
    },
}]

response = client.chat.completions.create(
    model="mercury-2",
    messages=[{"role": "user", "content": "What's the weather in Belgrade?"}],
    tools=tools,
    max_tokens=200,
)

for tc in response.choices[0].message.tool_calls:
    print(f"{tc.function.name}({tc.function.arguments})")

Output:

get_weather({
  "location": "Belgrade",
  "unit": "celsius"
})

Correct function, correct arguments, and it even inferred celsius for a European city. Finished in 0.678s with finish_reason: tool_calls. This works exactly as you'd expect from the OpenAI API — no surprises, no adaptation needed.

Test 5: Structured Output

JSON mode is critical for production pipelines. We tested with response_format={"type": "json_object"}:

response = client.chat.completions.create(
    model="mercury-2",
    messages=[{
        "role": "user",
        "content": 'List 3 programming languages with their year of creation. '
                   'Return as a JSON object with a "languages" key containing '
                   'an array of objects with "name" and "year" fields.',
    }],
    response_format={"type": "json_object"},
    max_tokens=300,
)

import json
parsed = json.loads(response.choices[0].message.content)
print(json.dumps(parsed, indent=2))

Output:

{
  "languages": [
    { "name": "C", "year": 1972 },
    { "name": "Python", "year": 1991 },
    { "name": "JavaScript", "year": 1995 }
  ]
}

Valid JSON, correct schema, accurate facts. Parsed without errors. For production use, you'd want to test with more complex schemas, but the basics are solid.

Test 6: Speed Consistency

We ran the same prompt three times to check for variance:

Run Tokens Time Speed
1 308 1.189s 259 tok/s
2 262 1.090s 240 tok/s
3 286 0.902s 317 tok/s
Average 272 tok/s
Peak 317 tok/s

Variance of 240–317 tok/s is acceptable. Differences come from network jitter, server load, and the model using different numbers of diffusion steps depending on output complexity.


The Speed Gap: Advertised vs. Measured

Measurement Speed Notes
Inception's benchmark 1,196 tok/s Server-side, no network
Our best (streaming, generation only) 558 tok/s Excludes TTFT
Our best (non-streaming, end-to-end) 495 tok/s Large output
Multi-run average 272 tok/s Medium output
Short output 117 tok/s Network dominates

We measured roughly half the advertised speed. That's not a knock on Mercury 2 — it's physics. Our tests ran from a consumer internet connection through the public API. The 1,196 tok/s figure is server-side throughput measured at the inference layer, before network overhead, TLS, HTTP framing, and Python SDK parsing eat into it.

To match their number, you'd need to benchmark from co-located infrastructure (same cloud region) or measure at the GPU layer. For what it's worth, 558 tok/s over the public internet is genuinely fast — most autoregressive models top out at 50-150 tok/s in comparable conditions.


How Does It Compare? Price & Speed

Speed only matters in context. Mercury 2 competes in the "fast and cheap" tier — models you'd use for high-volume pipelines, agents, and latency-sensitive applications, not frontier reasoning. Here's how it stacks up:

Pricing Comparison

Model Input $/M Output $/M Context Architecture
Mercury 2 $0.25 $1.00 128K Diffusion
DeepSeek V3 $0.28 $0.42 128K Autoregressive (MoE)
GPT-4.1 Nano $0.10 $0.40 1M Autoregressive
GPT-4o-mini $0.15 $0.60 128K Autoregressive
Gemini 2.0 Flash $0.10 $0.40 1M Autoregressive
Claude 3.5 Haiku $0.80 $4.00 200K Autoregressive
GPT-4o $2.50 $10.00 128K Autoregressive
Claude Sonnet 4.6 $3.00 $15.00 200K Autoregressive

On input pricing, Mercury 2 is mid-pack — GPT-4.1 Nano and Gemini 2.0 Flash are cheaper at $0.10/M. On output, it's $1.00/M — more expensive than DeepSeek ($0.42) and GPT-4.1 Nano ($0.40), but far cheaper than Claude Haiku ($4.00) or any mid-tier model.

The real cost story is output-heavy workloads. If you're generating long responses (agents, code generation, content pipelines), output pricing dominates. At $1.00/M output, Mercury 2 costs:

  • 2.4x more than GPT-4.1 Nano
  • 2.5x more than Gemini 2.0 Flash
  • 4x less than Claude 3.5 Haiku
  • 10x less than GPT-4o
  • 15x less than Claude Sonnet

Speed Comparison

Model Approx. Speed (tok/s) Notes
Mercury 2 495–558 (measured) Diffusion; 1,196 server-side
Gemini 2.0 Flash ~250 Google's speed tier
DeepSeek V3 ~100–160 Varies by load
GPT-4o-mini ~100–130 OpenAI speed tier
GPT-4.1 Nano ~150–200 OpenAI's fastest
Claude 3.5 Haiku ~80–100 Anthropic speed tier
GPT-4o ~60–90 Mid-tier
Claude Sonnet 4.6 ~70–80 Mid-tier

Speed figures are approximate client-side measurements and vary by network, region, and load. Mercury 2 figures are from our testing.

Even through the public internet, Mercury 2 is 2-3x faster than the next fastest competitor (Gemini Flash at ~250 tok/s) and 5-7x faster than mid-tier models like GPT-4o and Claude Sonnet. This is where the diffusion architecture genuinely shines — it's not marketing fluff.

Cost per Million Output Tokens at Speed

A useful way to think about it: what do you pay per million output tokens, and how fast do you get them?

Model Output $/M Speed (tok/s) Time for 1M tokens Cost per hour of output
Mercury 2 $1.00 550 ~30 min ~$2.00
GPT-4.1 Nano $0.40 175 ~95 min ~$0.25
DeepSeek V3 $0.42 130 ~128 min ~$0.20
Gemini 2.0 Flash $0.40 250 ~67 min ~$0.36
Claude 3.5 Haiku $4.00 90 ~185 min ~$1.30

Mercury 2 isn't the cheapest per token, but it delivers those tokens fastest. If your bottleneck is latency — how quickly you can complete an agentic loop, respond to a user, or process a document — Mercury 2 wins decisively. If your bottleneck is pure cost and you can tolerate slower speeds, DeepSeek V3 or GPT-4.1 Nano are cheaper.


Beyond Speed: Extended Testing

We ran a second test suite covering reasoning, multi-language, multi-turn conversation, agentic tool chains, needle-in-a-haystack retrieval, and edge cases. The results surfaced both strengths and a critical quirk.

The max_tokens Trap

The most important practical finding: Mercury 2 needs generous max_tokens values or it returns empty responses.

With autoregressive models, setting max_tokens=20 means "generate up to 20 tokens, stop when you're done." The model emits tokens one by one and stops early if it finishes. Mercury 2's diffusion architecture works differently — it appears to allocate the full output buffer upfront. If that buffer is too small, the model produces empty content with finish_reason=length and tokens=0:

# This fails silently — returns empty string
response = client.chat.completions.create(
    model="mercury-2",
    messages=[{"role": "user", "content": "What is 2+2?"}],
    max_tokens=10,  # too low for diffusion model
)
print(response.choices[0].message.content)  # ""

# This works — give it room
response = client.chat.completions.create(
    model="mercury-2",
    messages=[{"role": "user", "content": "What is 2+2?"}],
    max_tokens=150,  # generous headroom
)
print(response.choices[0].message.content)  # "4"

Rule of thumb: always set max_tokens to at least 150–200, even if you expect a short answer. The model will still stop early (finish_reason=stop) when it's done — you won't waste tokens. But if you set it too low, you get nothing. This is a significant difference from autoregressive models and will bite you in production if you're migrating existing code.

The Proof: 10/25 → 25/25

Our first run scored 10 out of 25 — a result that would make Mercury 2 look broken. Our second run, with only max_tokens increased, scored 25/25. Nothing else changed — same prompts, same model, same API. Here's the full breakdown:

Suite Initial Final What changed
Reasoning 5/6 6/6 Logic: 80→150 tokens, instruction: 120→250
Multi-language 0/3 3/3 30→200 tokens
Multi-turn 0/3 3/3 30–60→200 tokens
Agentic 3/4 4/4 Fixed step 3 logic (model skipped get_price)
Needle-in-Haystack 0/3 3/3 40→200 tokens
Concurrency 0/20 20/20 20→150 tokens
Sampling 0/1 1/1 10→150 tokens
Edge Cases 2/5 5/5 System prompt 20→150, JSON 80→200, long sys 30→150
Total 10/25 25/25

Every single failure traced back to the same root cause: max_tokens too low for the diffusion architecture. No actual quality or capability issues were found. If you're migrating from GPT or Claude, your existing max_tokens values are almost certainly too low for Mercury 2.

Quality & Reasoning: 6/6

Test Result Details
Arithmetic (17×23+14−5) PASS Returned 400 correctly
Word problem (45−12+8) PASS Returned 41 correctly
Logic (invalid syllogism) PASS Correctly answered "No" with valid reasoning
Code generation (fibonacci) PASS Clean Python function, 107 chars
Instruction following (3 bullets) PASS Exactly 3 dash-prefixed bullets
Factual recall (capital of Australia) PASS Canberra

Perfect score. Math, logic, code generation, instruction following, and factual recall all pass cleanly.

Multi-language: 3/3

Language Prompt Response
Serbian "Koji je glavni grad Srbije?" Beograd
German "Was ist die Hauptstadt von Deutschland?" Berlin
Japanese "日本の首都はどこですか?" 東京

Mercury 2 handles non-English prompts correctly — including Cyrillic-adjacent and CJK languages. Responses are accurate and concise.

Multi-turn Conversation: 3/3

We tested whether the model maintains context across turns:

messages = [
    {"role": "system", "content": "You are a helpful assistant. Be concise."},
    {"role": "user", "content": "My name is Marko and I live in Novi Sad."},
]
# ... assistant responds ...
messages.append({"role": "user", "content": "What is my name?"})
# → "Your name is Marko."
messages.append({"role": "user", "content": "Where do I live?"})
# → "You live in Novi Sad."

Both facts recalled correctly. We also tested persona consistency by assigning a pirate persona — Mercury 2 committed fully ("Arr, matey! Gather 'round the galley o' knowledge...") with 7 pirate-themed words in a single response.

Agentic Tool Chains: 4/4

This was the most impressive result. We defined three tools (search_product, get_price, add_to_cart) and asked Mercury 2 to find a blue t-shirt and add it to a cart:

Step 1: User asks "Find me a blue t-shirt and add it to my cart."
     → Model calls search_product(query="blue t-shirt")       ✓

Step 2: We return search results with SKU-1234
     → Model calls add_to_cart(product_id="SKU-1234")          ✓

Step 3: We confirm the cart addition
     → Model responds: "Your blue t-shirt has been added       ✓
        to your cart. Let me know if you'd like anything else."

Step 4: We return a tool error ("Service temporarily unavailable")
     → Model retries the tool call                             ✓

Four steps, four correct decisions. The model understood the task, chained tools in the right order, confirmed success in natural language, and recovered from an error by retrying. This validates Inception's pitch that Mercury 2 is built for agentic workloads.

Needle in a Haystack: 3/3

We hid the string MERCURY-FAST-7742 inside ~4,000 tokens of filler text at three positions:

Position Found?
Beginning MERCURY-FAST-7742
Middle MERCURY-FAST-7742
End MERCURY-FAST-7742

Perfect retrieval at all positions. The 128K context window handles information retrieval correctly — at least at the ~4K scale we tested.

Concurrency: 20/20

We fired parallel requests to test API behavior under load:

Parallel Requests Success Wall Time Total Tokens Avg Latency
5 5/5 0.78s 20 0.65s
15 15/15 0.88s 60 0.61s

Every request succeeded. Wall time barely increased from 5 to 15 parallel requests (0.78s → 0.88s), and average latency stayed consistent at ~0.6s. The API handles concurrency well — no throttling, no degradation at this scale.

Temperature & Sampling

Diffusion models sample differently from autoregressive models. We tested whether Mercury 2's temperature parameter behaves as expected:

temp=0.0: ['turquoise', 'turquoise', 'turquoise', 'turquoise'] — 1 unique
temp=0.5: ['turquoise', 'turquoise', 'turquoise', 'turquoise'] — 1 unique
temp=1.0: ['turquoise', 'cerulean', 'indigo', 'turquoise']     — 3 unique
temp=1.5: ['turquoise', 'turquoise', 'cyan', 'turquoise']      — 2 unique

Temperature works, but with a twist: diversity only kicks in above 0.5. At temp=0.0 and 0.5, responses are identical — the diffusion denoising process converges to the same output. At temp=1.0, we see real variety (turquoise, cerulean, indigo). Determinism at temp=0 is confirmed: ['4', '4', '4'] across three runs.

This is meaningfully different from autoregressive models, where temp=0.5 already produces some variation.

Edge Cases: 5/5

Test Result Details
Minimal prompt ("Hi") PASS "Hello! How can I assist you today?"
System prompt (exactly 3 words) PASS "I'm doing well." — exactly 3 words
Stop sequence PASS Correctly stopped before "10"
Nested JSON PASS Valid JSON with nested objects and arrays
Long system prompt (50 rules) PASS Returned "acknowledged"

All edge cases pass with adequate max_tokens headroom. System prompt adherence, stop sequences, and complex JSON structures all work correctly.

Extended Test Summary

Suite Score Verdict
Reasoning 6/6 Math, logic, code, facts, instructions
Multi-language 3/3 Serbian, German, Japanese all correct
Multi-turn 3/3 Memory and persona consistency
Agentic Loops 4/4 Multi-step tool chains + error recovery
Needle-in-Haystack 3/3 Perfect retrieval at all positions
Edge Cases 5/5 System prompts, stop sequences, nested JSON
Concurrency 20/20 No degradation at 15 parallel requests
Sampling 1/1 Deterministic at temp=0, diversity above 0.5
Total 25/25

The Bottom Line

Mercury 2 scored 25/25 on our extended test suite — every capability we tested works correctly. Reasoning, multi-language, multi-turn conversation, agentic tool chains, needle-in-a-haystack retrieval, concurrency, temperature sampling, and edge cases all pass. The OpenAI compatibility is seamless — you can swap it into an existing codebase in under a minute.

The one thing you must know before deploying: set max_tokens generously (150+), even for short expected outputs. The diffusion architecture needs output headroom or it returns silent empty responses. This is the single biggest gotcha when migrating from autoregressive models. The model still stops early when it's done — you won't waste tokens — but too-small a buffer produces nothing.

The speed advantage is genuine, though tempered by network reality. You won't see 1,196 tok/s from your laptop, but 400-550 tok/s is still 2-3x faster than the next fastest alternative. The agentic capabilities are particularly strong — multi-step tool chains with error recovery worked flawlessly, validating Inception's core pitch. Temperature sampling works but behaves differently: diversity only kicks in above 0.5, unlike autoregressive models where any non-zero temperature introduces variation.

It's not the cheapest model per token (GPT-4.1 Nano and DeepSeek V3 undercut it on output pricing), and it's not the smartest (frontier models like Claude Sonnet or GPT-4o have deeper reasoning). But in the speed-to-cost ratio for production workloads, Mercury 2 occupies a unique position — and as the first commercial diffusion LLM, it represents a genuine architectural bet that the rest of the industry is watching.

Specs at a Glance:

Model mercury-2
Architecture Diffusion LLM (dLLM)
Context window 128K tokens
Max completion 16,384 tokens
Input pricing $0.25/M tokens
Output pricing $1.00/M tokens
API compatibility OpenAI-compatible
Measured throughput 495–558 tok/s (client-side)
Share: Post Share

Related Articles