Skip to main content

Neurons, Layers, and Why Depth Matters

A neuron is a weighted sum followed by a kink. Stack a million in layers and you get a function that approximates almost anything.

· 2 min lezen
Naar het lab
▸ Probeer het zelf

Sleep een slider — het diagram reageert in real time.

FR /100
¶ De analogie

The voting-committee analogy

A single neuron is a tiny voter. It listens to a few inputs, weighs them ("trust this one a lot, that one a little"), adds the votes up, and either fires "yes" or stays quiet depending on whether the total clears a bar.

Stack a thousand voters in a layer and you get a committee. Stack a hundred committees in layers that pass their decisions to the next, and you get a deep network that can recognise faces, translate Mandarin, or predict the next token of code. None of the individual voters is smart — the structure is.

What one neuron does

output = activation(w · x + b)

  • x — vector of inputs.
  • w — vector of learned weights (one per input).
  • b — a learned bias.
  • activation — a non-linear bend (ReLU, GELU, sigmoid, tanh).

Without the activation, stacking neurons is mathematically equivalent to one big linear layer — useless for non-trivial tasks. The non-linear bend is what makes deep networks expressive.

Why depth beats width

The universal approximation theorem says a single (very wide) hidden layer can approximate any function. Reality: it would need exponentially many neurons. Deep networks compose simpler functions, which is exponentially more efficient for problems with hierarchical structure (like images, language, or code).

Common pattern: early layers learn primitives (edges, syllables, tokens), middle layers learn parts (corners, words, phrases), late layers learn whole concepts (faces, sentences, intents).

Activation functions you'll meet

  • ReLU (max(0, x)) — fast, simple, default for most networks.
  • GELU — smooth ReLU variant; standard inside Transformer FFNs.
  • Sigmoid / tanh — historically common, mostly retired from deep nets due to vanishing gradients; still used at outputs (e.g. binary classification).
  • Softmax — used at the very end to turn raw scores into a probability distribution over classes or vocab tokens.

Why "deep" learning is hard

Stacking layers naively breaks training: gradients vanish, activations blow up, weights become unbalanced. The infrastructure that makes deep networks trainable today is roughly:

  • Skip connections (ResNet) — let gradients bypass layers.
  • Layer / batch normalisation — keep activations in a stable range.
  • Better optimisers (Adam, AdamW) — handle uneven gradients.
  • Initialisation schemes (Xavier, He) — start weights at the right scale.

Without these, "deep" stops working past 5–10 layers. With them, 100+ layers train reliably.

Capacity vs data

A deeper, wider network has more capacity — it can fit harder functions, but it can also memorise noise. The brake is data: enough varied examples to force it to generalise instead of memorising. Modern LLMs train on trillions of tokens precisely because their parameter counts demand it.

Engr Mejba Ahmed

Engr Mejba Ahmed

Claude Code Expert · Online

👋

Hey there!

Quick Actions

WhatsApp Instant reply

Chat on WhatsApp

+880 1723 741224 · Instant reply

Popular Questions

Engr Mejba Ahmed is connected
Engr Mejba Ahmed is typing...
Engr Mejba Ahmed avatar

✉ Want me to follow up? Drop your email

Engr Mejba Ahmed avatar

📞 Connect Directly

Choose how you'd like to reach me

WhatsApp

+880 1723 741224

Email

[email protected]

✓ Details sent! I'll get back to you shortly.

Powered by OpenAI

335+

Blog Posts

25

AI Courses

63

Projects

Services & Expertise

Pricing & Process

Learning & Resources

Connect & Support