Home Concept Explainers Training & Fine-Tuning Gradient Descent: Rolling Downhill to a Smarter Model

Training & Fine-Tuning Agent loop 3 sliders

Gradient Descent: Rolling Downhill to a Smarter Model

Training is a marble rolling down a wrinkled hill — the loss landscape. Tune learning rate and momentum to see it slide, oscillate, or get stuck.

Apr 29, 2026 · 3 min read

Jump to the lab No sign-up · Free forever

▸ Try it yourself

Drag any slider — the diagram reacts in real time.

Space to play · ←/→ to scrub

Agent loop

FR /100 SN-74A

SPACE · ◄ ►

¶ The analogy

The foggy-mountain analogy

Imagine standing on a foggy mountain at night. You cannot see the valley, but you can feel the slope under your feet. The fastest way down is: take a small step in the steepest downhill direction, feel again, step again. Eventually you bottom out somewhere flat.

Gradient descent is that algorithm, run on the loss landscape — a high-dimensional surface where height = how wrong the model is. Each step nudges every parameter a tiny amount in the direction that lowers the loss the most.

The basic update rule

w_new = w_old − learning_rate × gradient

That is the entire idea. Backprop computes the gradient. The optimiser uses it to move every parameter a step opposite the steepest ascent. Smaller steps = stable but slow. Bigger steps = fast but risk overshoot.

Stochastic vs batch

Batch GD — average gradient over the whole dataset, then step. Smooth but expensive.
Stochastic (SGD) — gradient on one example, step. Noisy but cheap.
Mini-batch SGD — gradient on 32–1024 examples. The sweet spot. Used by everything in practice.

The noise from mini-batches is not a bug — it acts like a temperature, helping the optimiser escape shallow local minima.

Momentum, RMSProp, Adam

Plain gradient descent zigzags in narrow valleys. Modern optimisers add memory:

Momentum — keep a running average of past gradients, like a marble that has built up speed. Smooths the path.
RMSProp — scale each parameter's step by its recent gradient magnitude. Tames "fast" parameters and accelerates "slow" ones.
Adam / AdamW — combine momentum + RMSProp. Default optimiser for most of deep learning today.

The loss landscape is bumpy

For a deep network the loss surface has:

Local minima — bottoms that are not the global bottom. In practice, usually fine; many local minima have similar loss.
Saddle points — flat in some directions, downhill in others. More common than local minima in high dimensions; momentum helps you push through.
Plateaus — long flat stretches where gradients are tiny. Learning rate schedules (warmup + decay) help here.

You almost never find the global minimum. You find a good enough basin.

What can go wrong

Diverging loss (going up, becoming NaN) — learning rate too high, no gradient clipping.
Stagnant loss — learning rate too low, or gradient vanishing in deep layers.
Spiky loss — bad batch (unusual examples), or unstable layer norm. Smooth with EMA, gradient clipping, or larger effective batches.