Inception Labs launched Mercury 2 on February 24, claiming it's the fastest reasoning LLM available — a diffusion language model that generates text at 1,196 tokens per second, 5-10x faster than speed-optimized models like GPT-4.1 Nano and Claude 3.5 Haiku. At $0.25 per million input tokens, it's also among the cheapest.
We put those claims to the test.
The Pitch: Diffusion, Not Autoregressive
Every major LLM today — GPT, Claude, Llama, Gemini — is autoregressive: it generates tokens one at a time, left to right, each depending on all previous tokens. Mercury 2 takes a fundamentally different approach. Like Stable Diffusion for images, it starts with noise and iteratively refines all tokens in parallel.
The result, in theory: massively parallel generation that breaks the sequential bottleneck.
| Autoregressive (GPT, Claude) | Diffusion (Mercury 2) | |
|---|---|---|
| Generation | Sequential, token-by-token | Parallel, all-at-once |
| TTFT | Fast (200-400ms) | Slower (700ms+) |
| Throughput | Bounded by sequential nature | Scales with parallelism |
| Cost scaling | Linear with output length | Sub-linear potential |
| Sweet spot | Interactive chat, reasoning | Batch, pipelines, agents |
Getting Started: Two Lines of Change
Mercury 2 is fully OpenAI API-compatible. If you already use the OpenAI Python SDK, switching takes exactly two changes — the base URL and the API key:
from openai import OpenAI
client = OpenAI(
api_key=os.environ["INCEPTION_API_KEY"],
base_url="https://api.inceptionlabs.ai/v1",
)
That's it. Every client.chat.completions.create() call works the same as with OpenAI. No new SDK, no wrapper library, no config files. You can also use LiteLLM, AISuite, or LangChain's ChatOpenAI with a custom base_url.
Test 1: Can It Talk?
We started simple — ask it to explain itself:
response = client.chat.completions.create(
model="mercury-2",
messages=[{"role": "user", "content": "Explain diffusion language models in 2 sentences."}],
max_tokens=200,
)
print(response.choices[0].message.content)
Response:
Diffusion language models generate text by iteratively denoising a noisy token sequence, much like diffusion models for images, allowing many tokens to be produced in parallel rather than one-by-one. This parallel generation makes them several times faster and less than half as costly as traditional auto-regressive LLMs while also enabling fine-grained control over schema and multimodal integration.
75 tokens in 0.64 seconds. Clean, accurate, well-structured. No hallucinations. But 117 tok/s is a far cry from the advertised 1,196. On short outputs, network round-trip dominates — the model finishes generating before the response even reaches you.
Test 2: Pushing Throughput
To see real speed, you need to request longer outputs. We asked for a detailed Flask tutorial with max_tokens=1024:
response = client.chat.completions.create(
model="mercury-2",
messages=[{"role": "user", "content": "Write a detailed technical tutorial about building "
"a REST API with Python Flask. Cover routing, error handling, "
"database integration, authentication, and deployment."}],
max_tokens=1024,
)
| Metric | Value |
|---|---|
| Completion tokens | 866 |
| Wall time | 1.750s |
| Throughput | 495 tok/s |
866 tokens in under two seconds. The model hit the token limit and was still going — it had more to say. At 495 tok/s end-to-end from a consumer internet connection, this is already several times faster than what you'd get from GPT-4o or Claude Sonnet.
Test 3: Streaming — Where the Speed Really Shows
Streaming reveals how diffusion models behave differently. With autoregressive models, tokens trickle in one by one — you see the response being "typed out." With Mercury 2, there's a longer pause, then tokens arrive in bursts:
stream = client.chat.completions.create(
model="mercury-2",
messages=[{"role": "user", "content": "Write a comprehensive guide to Python "
"decorators with 5 examples."}],
max_tokens=1024,
stream=True,
stream_options={"include_usage": True},
)
for chunk in stream:
if chunk.choices and chunk.choices[0].delta.content:
print(chunk.choices[0].delta.content, end="", flush=True)
| Metric | Value |
|---|---|
| Completion tokens | 900 |
| TTFT (time to first token) | 741ms |
| Generation phase | 1.614s |
| Generation speed (excl TTFT) | 558 tok/s |
| End-to-end speed | 382 tok/s |
Here's the key insight: 558 tok/s during the generation phase. The 741ms time-to-first-token is higher than autoregressive models (which typically start streaming in 200-400ms), but that's because Mercury 2 does its "thinking" upfront — denoising all tokens in parallel — before emitting anything.
We received only 31 chunks for 900 tokens, meaning the API batches roughly 29 tokens per chunk. You don't see a character-by-character typewriter effect; you see paragraphs appearing in rapid bursts.
Test 4: Tool Use
Function calling is table-stakes for agentic applications. We defined a weather tool and asked about Belgrade:
tools = [{
"type": "function",
"function": {
"name": "get_weather",
"description": "Get the current weather for a location.",
"parameters": {
"type": "object",
"properties": {
"location": {"type": "string", "description": "City name"},
"unit": {"type": "string", "enum": ["celsius", "fahrenheit"]},
},
"required": ["location"],
},
},
}]
response = client.chat.completions.create(
model="mercury-2",
messages=[{"role": "user", "content": "What's the weather in Belgrade?"}],
tools=tools,
max_tokens=200,
)
for tc in response.choices[0].message.tool_calls:
print(f"{tc.function.name}({tc.function.arguments})")
Output:
get_weather({
"location": "Belgrade",
"unit": "celsius"
})
Correct function, correct arguments, and it even inferred celsius for a European city. Finished in 0.678s with finish_reason: tool_calls. This works exactly as you'd expect from the OpenAI API — no surprises, no adaptation needed.
Test 5: Structured Output
JSON mode is critical for production pipelines. We tested with response_format={"type": "json_object"}:
response = client.chat.completions.create(
model="mercury-2",
messages=[{
"role": "user",
"content": 'List 3 programming languages with their year of creation. '
'Return as a JSON object with a "languages" key containing '
'an array of objects with "name" and "year" fields.',
}],
response_format={"type": "json_object"},
max_tokens=300,
)
import json
parsed = json.loads(response.choices[0].message.content)
print(json.dumps(parsed, indent=2))
Output:
{
"languages": [
{ "name": "C", "year": 1972 },
{ "name": "Python", "year": 1991 },
{ "name": "JavaScript", "year": 1995 }
]
}
Valid JSON, correct schema, accurate facts. Parsed without errors. For production use, you'd want to test with more complex schemas, but the basics are solid.
Test 6: Speed Consistency
We ran the same prompt three times to check for variance:
| Run | Tokens | Time | Speed |
|---|---|---|---|
| 1 | 308 | 1.189s | 259 tok/s |
| 2 | 262 | 1.090s | 240 tok/s |
| 3 | 286 | 0.902s | 317 tok/s |
| Average | 272 tok/s | ||
| Peak | 317 tok/s |
Variance of 240–317 tok/s is acceptable. Differences come from network jitter, server load, and the model using different numbers of diffusion steps depending on output complexity.
The Speed Gap: Advertised vs. Measured
| Measurement | Speed | Notes |
|---|---|---|
| Inception's benchmark | 1,196 tok/s | Server-side, no network |
| Our best (streaming, generation only) | 558 tok/s | Excludes TTFT |
| Our best (non-streaming, end-to-end) | 495 tok/s | Large output |
| Multi-run average | 272 tok/s | Medium output |
| Short output | 117 tok/s | Network dominates |
We measured roughly half the advertised speed. That's not a knock on Mercury 2 — it's physics. Our tests ran from a consumer internet connection through the public API. The 1,196 tok/s figure is server-side throughput measured at the inference layer, before network overhead, TLS, HTTP framing, and Python SDK parsing eat into it.
To match their number, you'd need to benchmark from co-located infrastructure (same cloud region) or measure at the GPU layer. For what it's worth, 558 tok/s over the public internet is genuinely fast — most autoregressive models top out at 50-150 tok/s in comparable conditions.
How Does It Compare? Price & Speed
Speed only matters in context. Mercury 2 competes in the "fast and cheap" tier — models you'd use for high-volume pipelines, agents, and latency-sensitive applications, not frontier reasoning. Here's how it stacks up:
Pricing Comparison
| Model | Input $/M | Output $/M | Context | Architecture |
|---|---|---|---|---|
| Mercury 2 | $0.25 | $1.00 | 128K | Diffusion |
| DeepSeek V3 | $0.28 | $0.42 | 128K | Autoregressive (MoE) |
| GPT-4.1 Nano | $0.10 | $0.40 | 1M | Autoregressive |
| GPT-4o-mini | $0.15 | $0.60 | 128K | Autoregressive |
| Gemini 2.0 Flash | $0.10 | $0.40 | 1M | Autoregressive |
| Claude 3.5 Haiku | $0.80 | $4.00 | 200K | Autoregressive |
| GPT-4o | $2.50 | $10.00 | 128K | Autoregressive |
| Claude Sonnet 4.6 | $3.00 | $15.00 | 200K | Autoregressive |
On input pricing, Mercury 2 is mid-pack — GPT-4.1 Nano and Gemini 2.0 Flash are cheaper at $0.10/M. On output, it's $1.00/M — more expensive than DeepSeek ($0.42) and GPT-4.1 Nano ($0.40), but far cheaper than Claude Haiku ($4.00) or any mid-tier model.
The real cost story is output-heavy workloads. If you're generating long responses (agents, code generation, content pipelines), output pricing dominates. At $1.00/M output, Mercury 2 costs:
- 2.4x more than GPT-4.1 Nano
- 2.5x more than Gemini 2.0 Flash
- 4x less than Claude 3.5 Haiku
- 10x less than GPT-4o
- 15x less than Claude Sonnet
Speed Comparison
| Model | Approx. Speed (tok/s) | Notes |
|---|---|---|
| Mercury 2 | 495–558 (measured) | Diffusion; 1,196 server-side |
| Gemini 2.0 Flash | ~250 | Google's speed tier |
| DeepSeek V3 | ~100–160 | Varies by load |
| GPT-4o-mini | ~100–130 | OpenAI speed tier |
| GPT-4.1 Nano | ~150–200 | OpenAI's fastest |
| Claude 3.5 Haiku | ~80–100 | Anthropic speed tier |
| GPT-4o | ~60–90 | Mid-tier |
| Claude Sonnet 4.6 | ~70–80 | Mid-tier |
Speed figures are approximate client-side measurements and vary by network, region, and load. Mercury 2 figures are from our testing.
Even through the public internet, Mercury 2 is 2-3x faster than the next fastest competitor (Gemini Flash at ~250 tok/s) and 5-7x faster than mid-tier models like GPT-4o and Claude Sonnet. This is where the diffusion architecture genuinely shines — it's not marketing fluff.
Cost per Million Output Tokens at Speed
A useful way to think about it: what do you pay per million output tokens, and how fast do you get them?
| Model | Output $/M | Speed (tok/s) | Time for 1M tokens | Cost per hour of output |
|---|---|---|---|---|
| Mercury 2 | $1.00 | 550 | ~30 min | ~$2.00 |
| GPT-4.1 Nano | $0.40 | 175 | ~95 min | ~$0.25 |
| DeepSeek V3 | $0.42 | 130 | ~128 min | ~$0.20 |
| Gemini 2.0 Flash | $0.40 | 250 | ~67 min | ~$0.36 |
| Claude 3.5 Haiku | $4.00 | 90 | ~185 min | ~$1.30 |
Mercury 2 isn't the cheapest per token, but it delivers those tokens fastest. If your bottleneck is latency — how quickly you can complete an agentic loop, respond to a user, or process a document — Mercury 2 wins decisively. If your bottleneck is pure cost and you can tolerate slower speeds, DeepSeek V3 or GPT-4.1 Nano are cheaper.
Beyond Speed: Extended Testing
We ran a second test suite covering reasoning, multi-language, multi-turn conversation, agentic tool chains, needle-in-a-haystack retrieval, and edge cases. The results surfaced both strengths and a critical quirk.
The max_tokens Trap
The most important practical finding: Mercury 2 needs generous max_tokens values or it returns empty responses.
With autoregressive models, setting max_tokens=20 means "generate up to 20 tokens, stop when you're done." The model emits tokens one by one and stops early if it finishes. Mercury 2's diffusion architecture works differently — it appears to allocate the full output buffer upfront. If that buffer is too small, the model produces empty content with finish_reason=length and tokens=0:
# This fails silently — returns empty string
response = client.chat.completions.create(
model="mercury-2",
messages=[{"role": "user", "content": "What is 2+2?"}],
max_tokens=10, # too low for diffusion model
)
print(response.choices[0].message.content) # ""
# This works — give it room
response = client.chat.completions.create(
model="mercury-2",
messages=[{"role": "user", "content": "What is 2+2?"}],
max_tokens=150, # generous headroom
)
print(response.choices[0].message.content) # "4"
Rule of thumb: always set max_tokens to at least 100–200, even if you expect a short answer. The model will still stop early (finish_reason=stop) when it's done — you won't waste tokens. But if you set it too low, you get nothing. This is a significant difference from autoregressive models and will bite you in production if you're migrating existing code.
Quality & Reasoning: 5/6
| Test | Result | Details |
|---|---|---|
| Arithmetic (17×23+14−5) | PASS | Returned 400 correctly |
| Word problem (45−12+8) | PASS | Returned 41 correctly |
| Logic (invalid syllogism) | FAIL | Empty response (max_tokens issue) |
| Code generation (fibonacci) | PASS | Clean Python function, 182 chars |
| Instruction following (3 bullets) | PASS | Exactly 3 dash-prefixed bullets |
| Factual recall (capital of Australia) | PASS | Canberra |
Reasoning is solid. The one failure was the logic test returning empty — likely another max_tokens casualty at 80 tokens, not a reasoning failure. Math, code, instructions, and factual recall all pass cleanly.
Multi-language: 3/3
| Language | Prompt | Response |
|---|---|---|
| Serbian | "Koji je glavni grad Srbije?" | Beograd |
| German | "Was ist die Hauptstadt von Deutschland?" | Berlin |
| Japanese | "日本の首都はどこですか?" | 東京 |
Mercury 2 handles non-English prompts correctly — including Cyrillic-adjacent and CJK languages. Responses are accurate and concise.
Multi-turn Conversation: 3/3
We tested whether the model maintains context across turns:
messages = [
{"role": "system", "content": "You are a helpful assistant. Be concise."},
{"role": "user", "content": "My name is Marko and I live in Novi Sad."},
]
# ... assistant responds ...
messages.append({"role": "user", "content": "What is my name?"})
# → "Your name is Marko."
messages.append({"role": "user", "content": "Where do I live?"})
# → "You live in Novi Sad."
Both facts recalled correctly. We also tested persona consistency by assigning a pirate persona — Mercury 2 committed fully ("Arr, matey! Gather 'round the galley o' knowledge...") with 7 pirate-themed words in a single response.
Agentic Tool Chains: 4/4
This was the most impressive result. We defined three tools (search_product, get_price, add_to_cart) and asked Mercury 2 to find a blue t-shirt and add it to a cart:
Step 1: User asks "Find me a blue t-shirt and add it to my cart."
→ Model calls search_product(query="blue t-shirt") ✓
Step 2: We return search results with SKU-1234
→ Model calls add_to_cart(product_id="SKU-1234") ✓
Step 3: We confirm the cart addition
→ Model responds: "Your blue t-shirt has been added ✓
to your cart. Let me know if you'd like anything else."
Step 4: We return a tool error ("Service temporarily unavailable")
→ Model retries the tool call ✓
Four steps, four correct decisions. The model understood the task, chained tools in the right order, confirmed success in natural language, and recovered from an error by retrying. This validates Inception's pitch that Mercury 2 is built for agentic workloads.
Needle in a Haystack: 3/3
We hid the string MERCURY-FAST-7742 inside ~4,000 tokens of filler text at three positions:
| Position | Found? |
|---|---|
| Beginning | MERCURY-FAST-7742 |
| Middle | MERCURY-FAST-7742 |
| End | MERCURY-FAST-7742 |
Perfect retrieval at all positions. The 128K context window handles information retrieval correctly — at least at the ~4K scale we tested.
What Didn't Work
Concurrency returned 0 tokens across all 20 requests — the max_tokens=20 setting caused every parallel request to fail silently. This needs retesting with proper headroom.
Temperature sampling also produced empty responses at max_tokens=10. We couldn't evaluate whether the diffusion model's temperature parameter behaves differently from autoregressive models — an open question for future testing.
System prompt adherence with very constrained instructions ("respond in exactly 3 words") returned empty at max_tokens=20. With adequate headroom, Mercury 2 follows system prompts well (the pirate test proves this), but the tight token budget prevented evaluation of precise constraint-following.
Extended Test Summary
| Suite | Score | Verdict |
|---|---|---|
| Reasoning | 5/6 | Strong — math, code, facts, instructions |
| Multi-language | 3/3 | Serbian, German, Japanese all correct |
| Multi-turn | 3/3 | Memory and persona consistency |
| Agentic Loops | 4/4 | Multi-step tool chains + error recovery |
| Needle-in-Haystack | 3/3 | Perfect retrieval at all positions |
| Edge Cases | 3/5 | JSON and stop sequences work; system prompt tests need more headroom |
| Concurrency | — | Inconclusive (max_tokens issue) |
| Sampling | — | Inconclusive (max_tokens issue) |
| Total | 21/24 |
The Bottom Line
Mercury 2 is real and it works. The OpenAI compatibility is seamless — you can swap it into an existing codebase in under a minute. Tool use, streaming, JSON mode, multi-language, multi-turn, and agentic tool chains all function correctly. It scored 21/24 on our extended test suite, with the three failures all traceable to a single cause: the diffusion model's sensitivity to low max_tokens values.
That max_tokens quirk is the one thing you must know before deploying. Set it generously (200+) even for short expected outputs, or you'll get silent empty responses. It's not a bug per se — it's how diffusion generation works — but it will catch anyone migrating from autoregressive models.
The speed advantage is genuine, though tempered by network reality. You won't see 1,196 tok/s from your laptop, but 400-550 tok/s is still 2-3x faster than the next fastest alternative. The agentic capabilities are particularly strong — multi-step tool chains with error recovery worked flawlessly, validating Inception's core pitch.
It's not the cheapest model per token (GPT-4.1 Nano and DeepSeek V3 undercut it on output pricing), and it's not the smartest (frontier models like Claude Sonnet or GPT-4o have deeper reasoning). But in the speed-to-cost ratio for production workloads, Mercury 2 occupies a unique position — and as the first commercial diffusion LLM, it represents a genuine architectural bet that the rest of the industry is watching.
Specs at a Glance:
| Model | mercury-2 |
| Architecture | Diffusion LLM (dLLM) |
| Context window | 128K tokens |
| Max completion | 16,384 tokens |
| Input pricing | $0.25/M tokens |
| Output pricing | $1.00/M tokens |
| API compatibility | OpenAI-compatible |
| Measured throughput | 495–558 tok/s (client-side) |