Home Concept Explainers Inference & Optimization KV Cache: Why the Second Token Is Faster Than the First

Inference & Optimization MCP handshake 3 sliders

KV Cache: Why the Second Token Is Faster Than the First

Without a KV cache, every new token re-computes attention over the whole sequence. With it, you reuse all previous work. This is most of LLM serving.

Apr 29, 2026 · 3 min lezen

Naar het lab Geen registratie · Voor altijd gratis

▸ Probeer het zelf

Sleep een slider — het diagram reageert in real time.

Spatie voor play · ←/→ om te scrubben

MCP handshake

FR /100 SN-312

SPACE · ◄ ►

¶ De analogie

The chef's prep analogy

A chef starts a busy night by prepping: dicing onions, mincing garlic, deglazing stock. That prep is slow. But once it is done, every dish that night reuses it — they just plate fast on top.

In an LLM, the prefill is the prep: chew through the whole prompt and store the K and V vectors for every token. The decode is the plating: each new token just uses what is in the fridge. Without the fridge (the KV cache), every dish would re-prep from scratch.

Why we cache K and V (not Q)

In every attention layer, each token has three projections: Query, Key, Value.

Q — only the current token's query is needed at each step. Don't cache.
K, V — every previous token's K and V are needed every time a new token attends backward. Cache aggressively.

When generating token 100, instead of recomputing Q/K/V for tokens 1–99, you reuse the cached K/V for those 99 and only do fresh work for token 100. That is the entire optimisation.

Two phases of every generation

Prefill — process the whole prompt in one batch. Compute attention end-to-end. Compute-bound. Slow.
Decode — generate one token at a time. Each step adds one row of K and V to the cache. Memory-bound. Much faster per token.

The first token is slow (prefill). Tokens 2…N are fast (decode). This is why streaming feels "first-byte slow, rest fast."

How big is the cache?

Per token, per layer, you store K and V — two vectors of size hidden_size. For a 70B model with 80 layers and a 16k context, the cache is ~10–40 GB depending on dtype. The cache often dwarfs the model itself in serving memory.

Why this is the bottleneck

Modern GPUs are bandwidth-limited at decode. Reading 30 GB of cache from VRAM into compute units, every single token, is the dominant cost. Tricks to fight this:

PagedAttention (vLLM) — store the cache in fixed-size pages so memory is not fragmented across many concurrent requests.
Continuous batching — reschedule cache fragments so different requests share GPU time without waiting for the longest one.
Multi-query attention (MQA) and grouped-query attention (GQA) — share K and V across multiple query heads, shrinking the cache by 4–8× at small quality cost. Most modern LLMs use GQA.
Quantize the cache — store K/V in INT8 or INT4. Big savings.

Why long contexts are expensive

Cache size scales linearly with context length. Doubling context doubles cache memory and roughly doubles per-token decode latency too (more cache to read each step). This is why "1M-token context" needs serious engineering, not just bigger numbers in the config.

Practical implications

Prefill cost ≠ decode cost. Long prompts are expensive once. Long outputs are expensive every token.
Cache reuse across calls is huge for chat: replaying the same system prompt + history. Frameworks like vLLM and TRT-LLM support prefix caching.
Watch your cache budget, not just your weights. A small model with a giant context can use more memory than a big model with a small context.

From the field

The KV cache is the invisible thing that decides how many concurrent users a GPU can hold — it's memory, not compute, that usually runs out first. The builder's takeaway: anything that lets you reuse a cached prefix is gold, which is exactly why provider-side prompt caching saves so much. When I self-host, the knob that moved throughput most wasn't a faster model, it was fitting more sequences in memory by capping max context and evicting idle sessions. If your serving cost scales with concurrent chats more than with total tokens, the KV cache is why — and where you go looking for savings.

→ Wilt u dit in uw stack?

AWS Cloud Infrastructure & DevOps Engineering

Get an AWS environment that is secure by default, sized to what you actually run, and documented so your team can operate it without guesswork. Whether you are launching a new workload or inheriting a...

Zie hoe ik kan helpen