The puppy-training analogy
A puppy does not read a manual. It tries something — sit, jump, chew the slipper — and either gets a treat (good) or a sharp "no" (bad). Over time, behaviours that earn treats happen more often. Behaviours that earn nothing fade out.
Reinforcement learning (RL) is exactly that, scaled to a computer. An agent acts in an environment, receives reward, and slowly updates its policy (its strategy for choosing actions) so that high-reward actions become more likely in similar situations.
The five things every RL setup has
- State (s) — what the agent observes right now.
- Action (a) — what the agent can do from this state.
- Reward (r) — a number the environment gives back.
- Policy (π) — the function
state → actionthe agent is learning. - Discount factor (γ) — how much future reward matters relative to immediate reward.
The agent's goal is to learn a policy that maximises expected cumulative reward — not the next reward, but the total over a whole episode.
The exploration vs exploitation knife-edge
- Exploit — pick the action you currently believe is best.
- Explore — pick a random or uncertain action to learn something new.
Always exploit and you converge to the first good-enough policy you found and miss the great one. Always explore and you never cash in. Schedules like ε-greedy or upper-confidence-bound balance the two — usually a lot of exploration early, decaying toward exploitation as confidence grows.
Three families of RL algorithms
| Family | What it learns | Examples |
|---|---|---|
| Value-based | The value of each (state, action) | Q-learning, DQN |
| Policy-based | The policy directly | REINFORCE, PPO |
| Model-based | A model of the environment, then plans | MuZero, World Models |
Modern LLM-era RL (RLHF, RLAIF) is mostly policy-based with PPO or DPO variants.
Where RL is the right tool
- The agent acts repeatedly and gets feedback.
- Decisions affect future states (it is not just one-shot classification).
- A clear reward exists or can be designed.
If you cannot define the reward, you do not have an RL problem yet. You have a reward-design problem — which is often the real hard part.
The reward-hacking warning
Specify reward badly and the agent will optimise the letter of it, not the spirit. Classic stories: a boat-racing agent that learned to spin in a corner farming bonus pickups, a cleaning robot that knocked over the trash can to earn "items put in trash." Reward design is engineering — not an afterthought.