Infrastructure 8 min read

The KV Cache: the trick that makes LLMs fast — and slow

ai.rs Jun 10, 2026
The KV Cache: the trick that makes LLMs fast — and slow illustration

If you have ever pasted a long document into a chatbot and watched it crawl, or seen it refuse outright with an "out of memory" error, you have run into one of the most important and least-discussed pieces of machinery inside a large language model: the KV cache. It is the optimization that makes text generation practical at all — and, at the same time, the reason long contexts get slow and expensive. This article is about that tension: what the KV cache is, why it is necessary, how it helps, and why it turns into a problem as the context grows.

No solutions here — just a clear look at the problem.

How a language model writes a sentence

A large language model writes one token (roughly, one word-piece) at a time. To choose the next token, it looks back at everything written so far — the prompt plus what it has already generated — and decides what should come next.

The looking-back is done by attention. For each new token, the model forms a query ("what am I looking for right now?") and compares it against a key for every previous token ("what does each earlier token offer?"). The comparison produces a set of weights, and those weights are used to mix together a value from each previous token into a summary that informs the next word. Query, Key, Value — Q, K, V.

The crucial detail: this happens independently in every layer of the network (modern models have dozens), and the model must do it for every single new token it generates.

The naive way is absurdly wasteful

Here is the catch. To generate the 1,001st token, the model attends over the keys and values of the previous 1,000 tokens. To generate the 1,002nd, it attends over 1,001. And so on.

If you implemented this naively, each new token would mean re-running the entire sequence through the whole network to regenerate all those keys and values from scratch. Generating a 1,000-token answer would mean roughly a thousand full passes over a growing sequence — an enormous amount of repeated work. The cost grows with the cube of the length. It would be like re-reading an entire book from page one every time you wanted to recall a single fact.

And almost all of that work is redundant, because of one quiet but powerful fact:

The key and value of a token never change once it has been written.

Token #37's key and value depend only on token #37 and the tokens before it — none of which change as the model keeps generating. So there is no reason to ever compute them twice.

Enter the KV cache

The KV cache is exactly the obvious fix: compute each token's keys and values once, then store them.

Now generating a new token is cheap. The model:

  1. computes Q, K, V for the single new token,
  2. appends that token's K and V to the cache,
  3. attends its query against all the cached keys and values.

No re-running the past. The keys and values of every earlier token are sitting in memory, ready to be read. (Queries, by contrast, are not cached — there is always exactly one "current" query, and once a token is generated its query is never needed again. Only K and V accumulate.)

This single trick is what turns autoregressive generation from a theoretical curiosity into something that runs in real time. Without it, long conversations would be impossibly slow. With it, the model does a small, fixed amount of new work per token plus a quick read over its notes.

It really is like taking notes while you read: instead of re-reading the whole book to answer each question, you jot down what matters and glance back at your notes. Fast, sensible, obviously the right thing to do.

But the notes have to live somewhere

Here is where the trouble begins. Those notes — the cached keys and values — have to be stored, in the GPU's fast memory, right next to the model's weights. And the pile grows with every token.

How big is the pile? For every token, the cache must hold a key and a value, for every layer, for every attention head that has its own keys. Concretely, for a typical 8-billion-parameter model (dozens of layers, a handful of key/value heads, a few hundred numbers per head, stored as 2-byte values), the cache costs roughly 150 kilobytes per token.

That sounds tiny. Multiply it out:

  • a 10,000-token context → ~1.5 GB,
  • a 100,000-token context → ~15 GB,
  • and that is per conversation. Serving ten users at once multiplies it by ten.

For comparison, the model's weights might be ~16 GB and fixed. The cache is on top of that, and unlike the weights, it grows without bound as the context gets longer.

To make it concrete, here is the full KV cache for a handful of popular open-source models, at four context lengths. The per-token cost is fixed by the architecture (layers × key/value heads × head size × 2 for K-and-V × 2 bytes for bfloat16); the totals are just that, multiplied by the number of tokens — per conversation:

Model KV cache per token at 8K at 32K at 128K at 1M
Llama 3.2 1B 32 KB 0.25 GB 1.0 GB 4 GB 32 GB
Llama 3.1 8B 128 KB 1.0 GB 4 GB 16 GB 128 GB
Qwen3 8B 144 KB 1.1 GB 4.5 GB 18 GB 144 GB
Qwen2.5 32B 256 KB 2.0 GB 8 GB 32 GB 256 GB
Llama 3.1 70B 320 KB 2.5 GB 10 GB 40 GB 320 GB

(bfloat16; one sequence. The per-token cost is set by the architecture — e.g. Llama 3.1 8B is 32 layers × 8 key/value heads × 128 numbers per head × 2 (K and V) × 2 bytes. An 8-bit or 4-bit cache halves or quarters these totals; serving N users at once multiplies them by N.)

Read across a row and the explosion is obvious. A single 8B model at 128K tokens needs ~16 GB of cache — as much memory as the model itself — and a 24 GB consumer GPU is already out of room. At 1M tokens, every model here needs more cache than any single GPU has. Read down a column and you see the other half: bigger models have more layers, so the cache grows with model size too. The numbers only ever go one way.

Why long context gets slow

Memory is not just a capacity problem — it is a speed problem, and this is the part people find least intuitive.

When the model generates each new token, it has to read the entire KV cache out of memory to do its attention — every key and value, in every layer. At 100,000 tokens that is ~15 GB of data that must be streamed from GPU memory for every single token produced.

A GPU can only move so many gigabytes per second (its memory bandwidth). When the cache is small, this is no problem; the model is busy doing math. But as the cache grows, the model spends less time computing and more time simply waiting for the cache to arrive from memory. Generation becomes memory-bandwidth-bound: the longer the context, the more cache to stream each step, the slower each token comes out.

This is why a model that zips along on a short prompt slows to a crawl on a long one — even though it is doing the "same" work per token. The work per token is the same; the memory traffic per token is not. It scales with how much you have already said.

The notes analogy holds right to the end: once your stack of notes is thick enough, you spend all your time flipping through it, and barely any time thinking.

And then you hit the wall

Slowness is the gentle failure mode. The hard one is running out of memory entirely.

The KV cache lives in the same finite GPU memory as the model's weights. Weights take a fixed chunk; the cache eats the rest, and it eats more with every token. At some context length, weights + cache simply exceed the memory the card has, and the model stops — not slowly, but with an out-of-memory error. On a 32 GB consumer GPU running a ~16 GB model, there is only room for so much cache; a context of a few hundred thousand tokens may not fit at all.

So there is a hard ceiling on how much context you can hold, set not by the model's intelligence but by the arithmetic of the cache.

The tension, in one line

The KV cache is both the cure and the disease:

It exists to avoid recomputing the past — but storing the past costs memory that grows with every token, and reading the past costs bandwidth that grows with every token. The very thing that makes generation fast is the thing that makes long context slow and, eventually, impossible.

Short prompts hide this completely. It is only when you reach for long context — whole documents, long codebases, hours of conversation — that the cache stops being a clever footnote and becomes the central bottleneck of the entire system.

That bottleneck — its memory growth, its bandwidth cost, its hard ceiling — is the problem that a great deal of modern LLM-systems research is trying to get around. But understanding the problem clearly comes first, and the problem is simply this: the notes pile up.

Deploying AI for your business?

Inference, GPUs, and quantization choices look different in production. See where your business is on the readiness curve.

Take the AI Readiness Check
Share: Post Share

Read next