The voting-committee analogy
A single neuron is a tiny voter. It listens to a few inputs, weighs them ("trust this one a lot, that one a little"), adds the votes up, and either fires "yes" or stays quiet depending on whether the total clears a bar.
Stack a thousand voters in a layer and you get a committee. Stack a hundred committees in layers that pass their decisions to the next, and you get a deep network that can recognise faces, translate Mandarin, or predict the next token of code. None of the individual voters is smart — the structure is.
What one neuron does
output = activation(w · x + b)
x— vector of inputs.w— vector of learned weights (one per input).b— a learned bias.activation— a non-linear bend (ReLU, GELU, sigmoid, tanh).
Without the activation, stacking neurons is mathematically equivalent to one big linear layer — useless for non-trivial tasks. The non-linear bend is what makes deep networks expressive.
Why depth beats width
The universal approximation theorem says a single (very wide) hidden layer can approximate any function. Reality: it would need exponentially many neurons. Deep networks compose simpler functions, which is exponentially more efficient for problems with hierarchical structure (like images, language, or code).
Common pattern: early layers learn primitives (edges, syllables, tokens), middle layers learn parts (corners, words, phrases), late layers learn whole concepts (faces, sentences, intents).
Activation functions you'll meet
- ReLU (
max(0, x)) — fast, simple, default for most networks. - GELU — smooth ReLU variant; standard inside Transformer FFNs.
- Sigmoid / tanh — historically common, mostly retired from deep nets due to vanishing gradients; still used at outputs (e.g. binary classification).
- Softmax — used at the very end to turn raw scores into a probability distribution over classes or vocab tokens.
Why "deep" learning is hard
Stacking layers naively breaks training: gradients vanish, activations blow up, weights become unbalanced. The infrastructure that makes deep networks trainable today is roughly:
- Skip connections (ResNet) — let gradients bypass layers.
- Layer / batch normalisation — keep activations in a stable range.
- Better optimisers (Adam, AdamW) — handle uneven gradients.
- Initialisation schemes (Xavier, He) — start weights at the right scale.
Without these, "deep" stops working past 5–10 layers. With them, 100+ layers train reliably.
Capacity vs data
A deeper, wider network has more capacity — it can fit harder functions, but it can also memorise noise. The brake is data: enough varied examples to force it to generalise instead of memorising. Modern LLMs train on trillions of tokens precisely because their parameter counts demand it.