Skip to main content
Neural Networks & Deep Learning MCP handshake 3 sliders

Backpropagation: How a Network Actually Learns

Backprop is just credit assignment — blame each parameter for the error, in proportion. Tune learning rate and batch size to see training stabilise or diverge.

· 2 min de leitura
Ir para o laboratório
▸ Experimente você mesmo

Arraste um slider — o diagrama reage em tempo real.

FR /100
¶ A analogia

The bad-meal analogy

A restaurant kitchen serves a dish. The customer says "too salty." The head chef has to figure out whose fault it was: the saucier who poured the salt? The cook who reduced the stock? The dishwasher who… probably not.

Backpropagation is exactly that blame assignment, automated. The model produces an answer, the loss says "you were off by X," and backprop walks the network backwards assigning a slice of blame (a gradient) to every single parameter.

The two-pass dance

Training one example is two passes through the network:

  1. Forward pass — input flows in, predictions come out, a loss is computed (how wrong was that?).
  2. Backward pass — using the chain rule of calculus, the loss is propagated backwards through every operation, producing a gradient for every parameter.

Then the optimiser nudges each parameter a step opposite its gradient. Repeat for billions of examples.

Three things the gradient tells you

For each parameter w, the gradient ∂L/∂w says:

  • Sign — "increase me" or "decrease me" to lower the loss.
  • Magnitude — how strongly this parameter influences the current error.
  • Trust — large gradients on a single example are noisy; the average across a batch is what gets used.

The optimiser does not know what a parameter "means." It just walks downhill on the loss landscape, one small step at a time.

The knobs that make or break training

  • Learning rate — too high and the network diverges (loss goes to infinity); too low and it never finishes. Modern training uses schedules (warmup → cosine decay) instead of one number.
  • Batch size — bigger batches give cleaner gradients but cost more memory; smaller batches inject useful noise. Effective batch size is often grown via gradient accumulation.
  • Gradient clipping — cap the gradient norm to stop occasional huge gradients (from rare tokens, etc.) from exploding the weights.

What goes wrong

  • Vanishing gradients — gradients shrink to ~0 in deep networks, layers stop learning. Residual connections and normalisation fix most of this.
  • Exploding gradients — gradients balloon, weights become NaN. Clipping handles it.
  • Bad init — start with weights too big or too small and forward activations blow up before backprop runs once. Modern frameworks initialise sensibly by default.

Backprop in one sentence

Backpropagation is the chain rule of calculus, applied automatically, so the optimiser can blame each parameter in exact proportion to how much it contributed to the error.

That sentence is the whole field of deep learning training. Everything else — Adam, momentum, schedules, layer norm — is engineering on top of it.

Engr Mejba Ahmed

Engr Mejba Ahmed

Claude Code Expert · Online

👋

Hey there!

Quick Actions

WhatsApp Instant reply

Chat on WhatsApp

+880 1723 741224 · Instant reply

Popular Questions

Engr Mejba Ahmed is connected
Engr Mejba Ahmed is typing...
Engr Mejba Ahmed avatar

✉ Want me to follow up? Drop your email

Engr Mejba Ahmed avatar

📞 Connect Directly

Choose how you'd like to reach me

WhatsApp

+880 1723 741224

Email

[email protected]

✓ Details sent! I'll get back to you shortly.

Powered by OpenAI

335+

Blog Posts

25

AI Courses

63

Projects

Services & Expertise

Pricing & Process

Learning & Resources

Connect & Support