The foggy-mountain analogy
Imagine standing on a foggy mountain at night. You cannot see the valley, but you can feel the slope under your feet. The fastest way down is: take a small step in the steepest downhill direction, feel again, step again. Eventually you bottom out somewhere flat.
Gradient descent is that algorithm, run on the loss landscape — a high-dimensional surface where height = how wrong the model is. Each step nudges every parameter a tiny amount in the direction that lowers the loss the most.
The basic update rule
w_new = w_old − learning_rate × gradient
That is the entire idea. Backprop computes the gradient. The optimiser uses it to move every parameter a step opposite the steepest ascent. Smaller steps = stable but slow. Bigger steps = fast but risk overshoot.
Stochastic vs batch
- Batch GD — average gradient over the whole dataset, then step. Smooth but expensive.
- Stochastic (SGD) — gradient on one example, step. Noisy but cheap.
- Mini-batch SGD — gradient on 32–1024 examples. The sweet spot. Used by everything in practice.
The noise from mini-batches is not a bug — it acts like a temperature, helping the optimiser escape shallow local minima.
Momentum, RMSProp, Adam
Plain gradient descent zigzags in narrow valleys. Modern optimisers add memory:
- Momentum — keep a running average of past gradients, like a marble that has built up speed. Smooths the path.
- RMSProp — scale each parameter's step by its recent gradient magnitude. Tames "fast" parameters and accelerates "slow" ones.
- Adam / AdamW — combine momentum + RMSProp. Default optimiser for most of deep learning today.
The loss landscape is bumpy
For a deep network the loss surface has:
- Local minima — bottoms that are not the global bottom. In practice, usually fine; many local minima have similar loss.
- Saddle points — flat in some directions, downhill in others. More common than local minima in high dimensions; momentum helps you push through.
- Plateaus — long flat stretches where gradients are tiny. Learning rate schedules (warmup + decay) help here.
You almost never find the global minimum. You find a good enough basin.
What can go wrong
- Diverging loss (going up, becoming NaN) — learning rate too high, no gradient clipping.
- Stagnant loss — learning rate too low, or gradient vanishing in deep layers.
- Spiky loss — bad batch (unusual examples), or unstable layer norm. Smooth with EMA, gradient clipping, or larger effective batches.
In practice
You will rarely write the optimiser. You will tune:
- Learning rate (the single most important hyperparameter).
- Schedule (warmup steps, decay shape).
- Batch size (within memory limits).
- Weight decay (regularisation pulling weights toward zero).
Get those four right and a competent architecture trains itself.