Skip to main content
Training & Fine-Tuning Agent loop 3 sliders

Gradient Descent: Rolling Downhill to a Smarter Model

Training is a marble rolling down a wrinkled hill — the loss landscape. Tune learning rate and momentum to see it slide, oscillate, or get stuck.

· 3 min read
Jump to the lab
▸ Try it yourself

Drag any slider — the diagram reacts in real time.

FR /100
¶ The analogy

The foggy-mountain analogy

Imagine standing on a foggy mountain at night. You cannot see the valley, but you can feel the slope under your feet. The fastest way down is: take a small step in the steepest downhill direction, feel again, step again. Eventually you bottom out somewhere flat.

Gradient descent is that algorithm, run on the loss landscape — a high-dimensional surface where height = how wrong the model is. Each step nudges every parameter a tiny amount in the direction that lowers the loss the most.

The basic update rule

w_new = w_old − learning_rate × gradient

That is the entire idea. Backprop computes the gradient. The optimiser uses it to move every parameter a step opposite the steepest ascent. Smaller steps = stable but slow. Bigger steps = fast but risk overshoot.

Stochastic vs batch

  • Batch GD — average gradient over the whole dataset, then step. Smooth but expensive.
  • Stochastic (SGD) — gradient on one example, step. Noisy but cheap.
  • Mini-batch SGD — gradient on 32–1024 examples. The sweet spot. Used by everything in practice.

The noise from mini-batches is not a bug — it acts like a temperature, helping the optimiser escape shallow local minima.

Momentum, RMSProp, Adam

Plain gradient descent zigzags in narrow valleys. Modern optimisers add memory:

  • Momentum — keep a running average of past gradients, like a marble that has built up speed. Smooths the path.
  • RMSProp — scale each parameter's step by its recent gradient magnitude. Tames "fast" parameters and accelerates "slow" ones.
  • Adam / AdamW — combine momentum + RMSProp. Default optimiser for most of deep learning today.

The loss landscape is bumpy

For a deep network the loss surface has:

  • Local minima — bottoms that are not the global bottom. In practice, usually fine; many local minima have similar loss.
  • Saddle points — flat in some directions, downhill in others. More common than local minima in high dimensions; momentum helps you push through.
  • Plateaus — long flat stretches where gradients are tiny. Learning rate schedules (warmup + decay) help here.

You almost never find the global minimum. You find a good enough basin.

What can go wrong

  • Diverging loss (going up, becoming NaN) — learning rate too high, no gradient clipping.
  • Stagnant loss — learning rate too low, or gradient vanishing in deep layers.
  • Spiky loss — bad batch (unusual examples), or unstable layer norm. Smooth with EMA, gradient clipping, or larger effective batches.

In practice

You will rarely write the optimiser. You will tune:

  • Learning rate (the single most important hyperparameter).
  • Schedule (warmup steps, decay shape).
  • Batch size (within memory limits).
  • Weight decay (regularisation pulling weights toward zero).

Get those four right and a competent architecture trains itself.

Engr Mejba Ahmed

Engr Mejba Ahmed

Claude Code Expert · Online

👋

Hey there!

Quick Actions

WhatsApp Instant reply

Chat on WhatsApp

+880 1723 741224 · Instant reply

Popular Questions

Engr Mejba Ahmed is connected
Engr Mejba Ahmed is typing...
Engr Mejba Ahmed avatar

✉ Want me to follow up? Drop your email

Engr Mejba Ahmed avatar

📞 Connect Directly

Choose how you'd like to reach me

WhatsApp

+880 1723 741224

Email

[email protected]

✓ Details sent! I'll get back to you shortly.

Powered by OpenAI

335+

Blog Posts

25

AI Courses

63

Projects

Services & Expertise

Pricing & Process

Learning & Resources

Connect & Support