The assembly-line analogy
Picture a factory line where every station is identical: each one takes a row of half-built items, refines them a little, and passes them on. Forty stations later, the items are finished products. No single station does much. The magic is the repetition and the shared technique.
A Transformer is exactly that. Each block looks the same — attention, then a feed-forward layer, with a couple of residual connections holding it all together. Stack 32 of them and you get a small LLM. Stack 96 and you get a frontier model.
What lives inside one block
A single Transformer block has just four moving parts:
- Layer norm — rescales the inputs so the math stays stable.
- Multi-head attention — every token looks at every other token and decides who matters.
- Residual connection — adds the input back to the output (skip connection) so deep stacks still train.
- Feed-forward network — a per-token MLP that "thinks" about each token independently.
That is the whole pattern: x = x + Attn(LN(x)); x = x + FFN(LN(x)).
Why stacking works
Each block can only do a small refinement. Early blocks pick up syntax — "this is a verb, that is a noun." Middle blocks find local meaning — "this is a function call, that is a string literal." Late blocks reason about the whole input — "this is a refactor request, the answer should be code."
The hierarchy is emergent, not designed. Researchers probe trained models to see which layer learned which abstraction. The exact map differs per model and task.
Decoder-only vs encoder-decoder
- Decoder-only (GPT, Llama, Claude) — every token can only attend to previous tokens. Used for text generation.
- Encoder-decoder (T5, original Transformer) — encoder sees the whole input, decoder generates conditioned on it. Used for translation-shaped tasks.
- Encoder-only (BERT) — every token sees all others. Used for classification and embeddings.
Modern generative AI is mostly decoder-only. That is the architecture you will work with day to day.
Practical numbers
| Model class | Layers | Hidden size | Params |
|---|---|---|---|
| Tiny on-device | 12–24 | 768 | 0.1–1B |
| Mid production | 32–48 | 4096 | 7–13B |
| Frontier flagship | 80–120 | 12288+ | 70B–1T+ |
Doubling layers does not double quality — there are diminishing returns. Doubling parameters with matched data, however, reliably moves the needle. This is the empirical "scaling law" everyone talks about.