Fundamentals 6 min read

Prompt Processing vs Token Generation: the Two Speeds of an LLM

ai.rs Jun 14, 2026

inference prefill decode ttft fundamentals

Every LLM has two speeds, and confusing them is the single biggest reason people misread benchmarks and buy the wrong hardware. When a model answers you it does two very different jobs: first it reads your prompt — Prompt Processing (PP, also called prefill) — then it writes the answer one token at a time — Token Generation (TG, also called decode). They run into different bottlenecks, report wildly different tokens-per-second, and are sped up by different things.

Two phases, one request

PP / prefill — the model ingests your entire prompt at once. All the prompt tokens are pushed through the network together, in a few large matrix multiplications. This phase is compute-bound: it is limited by the GPU's raw math throughput (FLOPs).
TG / decode — the model then generates the answer one token at a time, each new token depending on every token before it. This is sequential and memory-bandwidth-bound: producing each token means streaming the model's weights (and the ever-growing KV cache) out of memory. (If that part is fuzzy, read The KV Cache first.)

	Prompt Processing (PP)	Token Generation (TG)
Also called	prefill	decode
What it does	reads your prompt	writes the answer
Pattern	all tokens at once (parallel)	one token at a time (sequential)
Bottleneck	compute (FLOPs)	memory bandwidth
Typical speed	thousands of tok/s	tens–hundreds of tok/s
You feel it as	time to first token (TTFT)	how fast the text streams

Why the two numbers are so far apart

PP can chew through a 2,000-token prompt in one parallel sweep, so 2,000+ tok/s is normal. TG has to run a full forward pass per output token, and each pass re-reads the entire model from memory, so even a fast GPU manages only tens to a few hundred tok/s.

That is why a benchmark line like "2,053 tok/s prefill / 49.7 tok/s decode" is not a contradiction — it is PP versus TG. The big number is how fast the machine reads; the small number is how fast it writes, and the small one is what you stare at while the answer streams in.

Which one actually matters for you

Long prompts — big documents, long chat histories, RAG context — make PP the cost. You wait longer for the first token (higher TTFT).
Long answers — code generation, essays, agent loops — make TG the cost. The model feels slow as it types.
A rough mental model for a request: total time ≈ TTFT (PP) + N × (1 / TG) for an N-token answer.

What speeds up each phase

They respond to different levers — another reason to keep them separate:

PP (compute-bound): more FLOPs, lower-precision math (FP8/FP4), and the biggest win of all — prompt caching: reuse the prefill of a repeated prefix (a system prompt, a long document) and skip PP entirely on the next call.
TG (bandwidth-bound): faster memory bandwidth, smaller weights (quantization), fewer active parameters (MoE), and KV-cache-friendly attention (MLA / GQA). Batching raises total throughput across users but does not make a single stream's TG faster.

Why this decides your hardware

Here is the practical punchline. Different chips win different phases. A high-bandwidth GPU screams at TG — for models that fit its memory. A big-memory, lower-bandwidth box can hold far larger models, but its TG is throttled by bandwidth while its PP leans on (often more modest) compute. Look at only one number and you will buy the wrong machine — which is exactly what we dig into in AI Workstation Comparison: RTX 5090 vs GB10 (HP ZGX).

Bottom line

An LLM has two speeds: PP (prefill) reads the prompt; TG (decode) writes the answer.
PP is compute-bound and fast (thousands tok/s) — it sets your time-to-first-token.
TG is bandwidth-bound and slow (tens–hundreds tok/s) — it sets your streaming speed.
Long prompt → PP-heavy; long answer → TG-heavy.
Always read both numbers. A model — or a GPU — can be excellent at one and poor at the other.

Ready to put this into practice?

Understanding the fundamentals is one thing — building something for your business is another. See where you stand.

Take the AI Readiness Check

Share: Post Share

Prompt Processing vs Token Generation: the Two Speeds of an LLM

Two phases, one request

Why the two numbers are so far apart

Which one actually matters for you

What speeds up each phase

Why this decides your hardware

Bottom line

Ready to put this into practice?

Read next

What a 256K (or 1M) Context Window Actually Costs You

Apricot Jam: Fable 5 vs Sonnet 5 — Which AI Makes the Better Retro Game?

Qwen-AgentWorld: the Open Language World Model for AI Agents

On This Page

Developer Corner

Two phases, one request

Why the two numbers are so far apart

Which one actually matters for you

What speeds up each phase

Why this decides your hardware

Bottom line

Related reading

Ready to put this into practice?

Read next

What a 256K (or 1M) Context Window Actually Costs You

Apricot Jam: Fable 5 vs Sonnet 5 — Which AI Makes the Better Retro Game?

Qwen-AgentWorld: the Open Language World Model for AI Agents

On This Page

Developer Corner