The chef's prep analogy
A chef starts a busy night by prepping: dicing onions, mincing garlic, deglazing stock. That prep is slow. But once it is done, every dish that night reuses it — they just plate fast on top.
In an LLM, the prefill is the prep: chew through the whole prompt and store the K and V vectors for every token. The decode is the plating: each new token just uses what is in the fridge. Without the fridge (the KV cache), every dish would re-prep from scratch.
Why we cache K and V (not Q)
In every attention layer, each token has three projections: Query, Key, Value.
- Q — only the current token's query is needed at each step. Don't cache.
- K, V — every previous token's K and V are needed every time a new token attends backward. Cache aggressively.
When generating token 100, instead of recomputing Q/K/V for tokens 1–99, you reuse the cached K/V for those 99 and only do fresh work for token 100. That is the entire optimisation.
Two phases of every generation
- Prefill — process the whole prompt in one batch. Compute attention end-to-end. Compute-bound. Slow.
- Decode — generate one token at a time. Each step adds one row of K and V to the cache. Memory-bound. Much faster per token.
The first token is slow (prefill). Tokens 2…N are fast (decode). This is why streaming feels "first-byte slow, rest fast."
How big is the cache?
Per token, per layer, you store K and V — two vectors of size hidden_size. For a 70B model with 80 layers and a 16k context, the cache is ~10–40 GB depending on dtype. The cache often dwarfs the model itself in serving memory.
Why this is the bottleneck
Modern GPUs are bandwidth-limited at decode. Reading 30 GB of cache from VRAM into compute units, every single token, is the dominant cost. Tricks to fight this:
- PagedAttention (vLLM) — store the cache in fixed-size pages so memory is not fragmented across many concurrent requests.
- Continuous batching — reschedule cache fragments so different requests share GPU time without waiting for the longest one.
- Multi-query attention (MQA) and grouped-query attention (GQA) — share K and V across multiple query heads, shrinking the cache by 4–8× at small quality cost. Most modern LLMs use GQA.
- Quantize the cache — store K/V in INT8 or INT4. Big savings.
Why long contexts are expensive
Cache size scales linearly with context length. Doubling context doubles cache memory and roughly doubles per-token decode latency too (more cache to read each step). This is why "1M-token context" needs serious engineering, not just bigger numbers in the config.
Practical implications
- Prefill cost ≠ decode cost. Long prompts are expensive once. Long outputs are expensive every token.
- Cache reuse across calls is huge for chat: replaying the same system prompt + history. Frameworks like vLLM and TRT-LLM support prefix caching.
- Watch your cache budget, not just your weights. A small model with a giant context can use more memory than a big model with a small context.