Skip to main content
Inference & Optimization MCP handshake 3 sliders

KV Cache: Why the Second Token Is Faster Than the First

Without a KV cache, every new token re-computes attention over the whole sequence. With it, you reuse all previous work. This is most of LLM serving.

· 3 min lezen
Naar het lab
▸ Probeer het zelf

Sleep een slider — het diagram reageert in real time.

FR /100
¶ De analogie

The chef's prep analogy

A chef starts a busy night by prepping: dicing onions, mincing garlic, deglazing stock. That prep is slow. But once it is done, every dish that night reuses it — they just plate fast on top.

In an LLM, the prefill is the prep: chew through the whole prompt and store the K and V vectors for every token. The decode is the plating: each new token just uses what is in the fridge. Without the fridge (the KV cache), every dish would re-prep from scratch.

Why we cache K and V (not Q)

In every attention layer, each token has three projections: Query, Key, Value.

  • Q — only the current token's query is needed at each step. Don't cache.
  • K, V — every previous token's K and V are needed every time a new token attends backward. Cache aggressively.

When generating token 100, instead of recomputing Q/K/V for tokens 1–99, you reuse the cached K/V for those 99 and only do fresh work for token 100. That is the entire optimisation.

Two phases of every generation

  1. Prefill — process the whole prompt in one batch. Compute attention end-to-end. Compute-bound. Slow.
  2. Decode — generate one token at a time. Each step adds one row of K and V to the cache. Memory-bound. Much faster per token.

The first token is slow (prefill). Tokens 2…N are fast (decode). This is why streaming feels "first-byte slow, rest fast."

How big is the cache?

Per token, per layer, you store K and V — two vectors of size hidden_size. For a 70B model with 80 layers and a 16k context, the cache is ~10–40 GB depending on dtype. The cache often dwarfs the model itself in serving memory.

Why this is the bottleneck

Modern GPUs are bandwidth-limited at decode. Reading 30 GB of cache from VRAM into compute units, every single token, is the dominant cost. Tricks to fight this:

  • PagedAttention (vLLM) — store the cache in fixed-size pages so memory is not fragmented across many concurrent requests.
  • Continuous batching — reschedule cache fragments so different requests share GPU time without waiting for the longest one.
  • Multi-query attention (MQA) and grouped-query attention (GQA) — share K and V across multiple query heads, shrinking the cache by 4–8× at small quality cost. Most modern LLMs use GQA.
  • Quantize the cache — store K/V in INT8 or INT4. Big savings.

Why long contexts are expensive

Cache size scales linearly with context length. Doubling context doubles cache memory and roughly doubles per-token decode latency too (more cache to read each step). This is why "1M-token context" needs serious engineering, not just bigger numbers in the config.

Practical implications

  • Prefill cost ≠ decode cost. Long prompts are expensive once. Long outputs are expensive every token.
  • Cache reuse across calls is huge for chat: replaying the same system prompt + history. Frameworks like vLLM and TRT-LLM support prefix caching.
  • Watch your cache budget, not just your weights. A small model with a giant context can use more memory than a big model with a small context.
Engr Mejba Ahmed

Engr Mejba Ahmed

Claude Code Expert · Online

👋

Hey there!

Quick Actions

WhatsApp Instant reply

Chat on WhatsApp

+880 1723 741224 · Instant reply

Popular Questions

Engr Mejba Ahmed is connected
Engr Mejba Ahmed is typing...
Engr Mejba Ahmed avatar

✉ Want me to follow up? Drop your email

Engr Mejba Ahmed avatar

📞 Connect Directly

Choose how you'd like to reach me

WhatsApp

+880 1723 741224

Email

[email protected]

✓ Details sent! I'll get back to you shortly.

Powered by OpenAI

335+

Blog Posts

25

AI Courses

63

Projects

Services & Expertise

Pricing & Process

Learning & Resources

Connect & Support