Skip to main content

Attention: How Models Decide What Matters

Attention is a soft lookup — every token asks every other token "are you relevant?" and weights the answer. See it move with sliders.

· 2 Min. Lesezeit
Zum Lab springen
▸ Selbst ausprobieren

Zieh einen Slider — das Diagramm reagiert in Echtzeit.

FR /100
¶ Die Analogie

The classroom analogy

You are a student in a packed lecture, taking notes on a single new concept. To understand it, your eyes flick to earlier slides — but only the slides that matter. The slide titled "Definition of derivative" gets a long stare; the one with the lecturer's vacation photos gets ignored.

Attention is that flick. For every token a model is processing, it looks back over all the other tokens and gives each one a weight: this one is highly relevant (long stare), that one is barely relevant (glance). The weighted mix becomes the new representation of the current token.

The Q, K, V trio

Attention is built from three projections of every token's vector:

  • Query (Q) — "what am I looking for?"
  • Key (K) — "what do I have to offer?"
  • Value (V) — "what payload do I deliver if matched?"

The score for token i attending to token j is Q_i · K_j (a dot product). Softmax those scores so they sum to 1, then mix the V vectors by those weights. That mix is the new token representation.

Three matrix multiplies + a softmax. That is it.

Why "multi-head"

Doing attention once forces all the relationships through one lens. Multi-head runs the attention computation in parallel several times, each with its own learned Q/K/V projections, then concatenates. One head might learn syntactic agreement, another long-range coreference, another local punctuation patterns.

Modern LLMs use 16–128 heads. More heads with a smaller per-head dimension generally beats fewer heads with a bigger dimension.

Causal masking

In decoder-only LLMs, attention is masked: token at position 5 can attend to positions 1–5 but not 6+. This is what makes them autoregressive — they cannot peek at future tokens during training. It is implemented by setting forbidden scores to -∞ before the softmax.

The cost shape

Attention compares every pair of tokens, so it scales O(n²) with sequence length. Doubling context quadruples attention compute. This is why long-context models invest heavily in:

  • FlashAttention — same math, smarter memory layout, big real-world speedup.
  • Sparse attention — only attend to a subset (sliding window, global tokens).
  • Linear attention — approximations that drop the quadratic term.

The model architecture often stays "vanilla attention" while the kernel underneath is one of these tricks.

Engr Mejba Ahmed

Engr Mejba Ahmed

Claude Code Expert · Online

👋

Hey there!

Quick Actions

WhatsApp Instant reply

Chat on WhatsApp

+880 1723 741224 · Instant reply

Popular Questions

Engr Mejba Ahmed is connected
Engr Mejba Ahmed is typing...
Engr Mejba Ahmed avatar

✉ Want me to follow up? Drop your email

Engr Mejba Ahmed avatar

📞 Connect Directly

Choose how you'd like to reach me

WhatsApp

+880 1723 741224

Email

[email protected]

✓ Details sent! I'll get back to you shortly.

Powered by OpenAI

335+

Blog Posts

25

AI Courses

63

Projects

Services & Expertise

Pricing & Process

Learning & Resources

Connect & Support