The classroom analogy
You are a student in a packed lecture, taking notes on a single new concept. To understand it, your eyes flick to earlier slides — but only the slides that matter. The slide titled "Definition of derivative" gets a long stare; the one with the lecturer's vacation photos gets ignored.
Attention is that flick. For every token a model is processing, it looks back over all the other tokens and gives each one a weight: this one is highly relevant (long stare), that one is barely relevant (glance). The weighted mix becomes the new representation of the current token.
The Q, K, V trio
Attention is built from three projections of every token's vector:
- Query (Q) — "what am I looking for?"
- Key (K) — "what do I have to offer?"
- Value (V) — "what payload do I deliver if matched?"
The score for token i attending to token j is Q_i · K_j (a dot product). Softmax those scores so they sum to 1, then mix the V vectors by those weights. That mix is the new token representation.
Three matrix multiplies + a softmax. That is it.
Why "multi-head"
Doing attention once forces all the relationships through one lens. Multi-head runs the attention computation in parallel several times, each with its own learned Q/K/V projections, then concatenates. One head might learn syntactic agreement, another long-range coreference, another local punctuation patterns.
Modern LLMs use 16–128 heads. More heads with a smaller per-head dimension generally beats fewer heads with a bigger dimension.
Causal masking
In decoder-only LLMs, attention is masked: token at position 5 can attend to positions 1–5 but not 6+. This is what makes them autoregressive — they cannot peek at future tokens during training. It is implemented by setting forbidden scores to -∞ before the softmax.
The cost shape
Attention compares every pair of tokens, so it scales O(n²) with sequence length. Doubling context quadruples attention compute. This is why long-context models invest heavily in:
- FlashAttention — same math, smarter memory layout, big real-world speedup.
- Sparse attention — only attend to a subset (sliding window, global tokens).
- Linear attention — approximations that drop the quadratic term.
The model architecture often stays "vanilla attention" while the kernel underneath is one of these tricks.