Fundamentals 6 min read

Prompt Processing vs Token Generation: the Two Speeds of an LLM

ai.rs Jun 14, 2026
Prompt Processing vs Token Generation: the Two Speeds of an LLM illustration

Every LLM has two speeds, and confusing them is the single biggest reason people misread benchmarks and buy the wrong hardware. When a model answers you it does two very different jobs: first it reads your promptPrompt Processing (PP, also called prefill) — then it writes the answer one token at a timeToken Generation (TG, also called decode). They run into different bottlenecks, report wildly different tokens-per-second, and are sped up by different things.

Two phases, one request

  • PP / prefill — the model ingests your entire prompt at once. All the prompt tokens are pushed through the network together, in a few large matrix multiplications. This phase is compute-bound: it is limited by the GPU's raw math throughput (FLOPs).
  • TG / decode — the model then generates the answer one token at a time, each new token depending on every token before it. This is sequential and memory-bandwidth-bound: producing each token means streaming the model's weights (and the ever-growing KV cache) out of memory. (If that part is fuzzy, read The KV Cache first.)
Prompt Processing (PP) Token Generation (TG)
Also called prefill decode
What it does reads your prompt writes the answer
Pattern all tokens at once (parallel) one token at a time (sequential)
Bottleneck compute (FLOPs) memory bandwidth
Typical speed thousands of tok/s tens–hundreds of tok/s
You feel it as time to first token (TTFT) how fast the text streams

Why the two numbers are so far apart

PP can chew through a 2,000-token prompt in one parallel sweep, so 2,000+ tok/s is normal. TG has to run a full forward pass per output token, and each pass re-reads the entire model from memory, so even a fast GPU manages only tens to a few hundred tok/s.

That is why a benchmark line like "2,053 tok/s prefill / 49.7 tok/s decode" is not a contradiction — it is PP versus TG. The big number is how fast the machine reads; the small number is how fast it writes, and the small one is what you stare at while the answer streams in.

Which one actually matters for you

  • Long prompts — big documents, long chat histories, RAG context — make PP the cost. You wait longer for the first token (higher TTFT).
  • Long answers — code generation, essays, agent loops — make TG the cost. The model feels slow as it types.
  • A rough mental model for a request: total time ≈ TTFT (PP) + N × (1 / TG) for an N-token answer.

What speeds up each phase

They respond to different levers — another reason to keep them separate:

  • PP (compute-bound): more FLOPs, lower-precision math (FP8/FP4), and the biggest win of all — prompt caching: reuse the prefill of a repeated prefix (a system prompt, a long document) and skip PP entirely on the next call.
  • TG (bandwidth-bound): faster memory bandwidth, smaller weights (quantization), fewer active parameters (MoE), and KV-cache-friendly attention (MLA / GQA). Batching raises total throughput across users but does not make a single stream's TG faster.

Why this decides your hardware

Here is the practical punchline. Different chips win different phases. A high-bandwidth GPU screams at TG — for models that fit its memory. A big-memory, lower-bandwidth box can hold far larger models, but its TG is throttled by bandwidth while its PP leans on (often more modest) compute. Look at only one number and you will buy the wrong machine — which is exactly what we dig into in AI Workstation Comparison: RTX 5090 vs GB10 (HP ZGX).

Bottom line

  • An LLM has two speeds: PP (prefill) reads the prompt; TG (decode) writes the answer.
  • PP is compute-bound and fast (thousands tok/s) — it sets your time-to-first-token.
  • TG is bandwidth-bound and slow (tens–hundreds tok/s) — it sets your streaming speed.
  • Long prompt → PP-heavy; long answer → TG-heavy.
  • Always read both numbers. A model — or a GPU — can be excellent at one and poor at the other.

Ready to put this into practice?

Understanding the fundamentals is one thing — building something for your business is another. See where you stand.

Take the AI Readiness Check
Share: Post Share

Read next