Home Concept Explainers Neural Networks & Deep Learning Attention: How Models Decide What Matters

Neural Networks & Deep Learning Agent loop 3 Slider

Attention: How Models Decide What Matters

Attention is a soft lookup — every token asks every other token "are you relevant?" and weights the answer. See it move with sliders.

Apr 29, 2026 · 3 Min. Lesezeit

Zum Lab springen Keine Anmeldung · Für immer kostenlos

▸ Selbst ausprobieren

Zieh einen Slider — das Diagramm reagiert in Echtzeit.

Leertaste für Play · ←/→ zum Scrubben

Agent loop

FR /100 SN-74A

SPACE · ◄ ►

¶ Die Analogie

The classroom analogy

You are a student in a packed lecture, taking notes on a single new concept. To understand it, your eyes flick to earlier slides — but only the slides that matter. The slide titled "Definition of derivative" gets a long stare; the one with the lecturer's vacation photos gets ignored.

Attention is that flick. For every token a model is processing, it looks back over all the other tokens and gives each one a weight: this one is highly relevant (long stare), that one is barely relevant (glance). The weighted mix becomes the new representation of the current token.

The Q, K, V trio

Attention is built from three projections of every token's vector:

Query (Q) — "what am I looking for?"
Key (K) — "what do I have to offer?"
Value (V) — "what payload do I deliver if matched?"

The score for token i attending to token j is Q_i · K_j (a dot product). Softmax those scores so they sum to 1, then mix the V vectors by those weights. That mix is the new token representation.

Three matrix multiplies + a softmax. That is it.

Why "multi-head"

Doing attention once forces all the relationships through one lens. Multi-head runs the attention computation in parallel several times, each with its own learned Q/K/V projections, then concatenates. One head might learn syntactic agreement, another long-range coreference, another local punctuation patterns.

Modern LLMs use 16–128 heads. More heads with a smaller per-head dimension generally beats fewer heads with a bigger dimension.

Causal masking

In decoder-only LLMs, attention is masked: token at position 5 can attend to positions 1–5 but not 6+. This is what makes them autoregressive — they cannot peek at future tokens during training. It is implemented by setting forbidden scores to -∞ before the softmax.

The cost shape

Attention compares every pair of tokens, so it scales O(n²) with sequence length. Doubling context quadruples attention compute. This is why long-context models invest heavily in:

FlashAttention — same math, smarter memory layout, big real-world speedup.
Sparse attention — only attend to a subset (sliding window, global tokens).
Linear attention — approximations that drop the quadratic term.

The model architecture often stays "vanilla attention" while the kernel underneath is one of these tricks.

From the field

The practical takeaway from attention isn't the Q/K/V math — it's that not all positions in your prompt are treated equally. Models attend most reliably to the start and end of the context and get measurably worse at facts stranded in the middle of a long prompt. So when I write prompts I put the instruction and the most important context up top and the user's actual question at the very bottom — not buried between ten retrieved documents. Same words, different placement, noticeably better answers. If a long prompt seems to "ignore" an instruction, move it to an edge before you blame the model.

→ Wollen Sie das in Ihrem Stack?

AI Integration for Your App — ChatGPT, Claude & RAG

Your product already works. The goal here is to make it smarter, deflect repetitive support, turn your own content and data into answers, and automate the manual steps, without rebuilding from scratch...

So kann ich helfen