Home Concept Explainers Reinforcement Learning Reinforcement Learning, From Reward Signal to Smart Policy

Reinforcement Learning Agent loop 3 sliders

Reinforcement Learning, From Reward Signal to Smart Policy

RL is just trial, error, and reward — repeated billions of times. Tune learning rate, exploration, and discount to feel how a policy emerges.

Apr 29, 2026 · 3 min read

Jump to the lab No sign-up · Free forever

▸ Try it yourself

Drag any slider — the diagram reacts in real time.

Space to play · ←/→ to scrub

Agent loop

FR /100 SN-74A

SPACE · ◄ ►

¶ The analogy

The puppy-training analogy

A puppy does not read a manual. It tries something — sit, jump, chew the slipper — and either gets a treat (good) or a sharp "no" (bad). Over time, behaviours that earn treats happen more often. Behaviours that earn nothing fade out.

Reinforcement learning (RL) is exactly that, scaled to a computer. An agent acts in an environment, receives reward, and slowly updates its policy (its strategy for choosing actions) so that high-reward actions become more likely in similar situations.

The five things every RL setup has

State (s) — what the agent observes right now.
Action (a) — what the agent can do from this state.
Reward (r) — a number the environment gives back.
Policy (π) — the function state → action the agent is learning.
Discount factor (γ) — how much future reward matters relative to immediate reward.

The agent's goal is to learn a policy that maximises expected cumulative reward — not the next reward, but the total over a whole episode.

The exploration vs exploitation knife-edge

Exploit — pick the action you currently believe is best.
Explore — pick a random or uncertain action to learn something new.

Always exploit and you converge to the first good-enough policy you found and miss the great one. Always explore and you never cash in. Schedules like ε-greedy or upper-confidence-bound balance the two — usually a lot of exploration early, decaying toward exploitation as confidence grows.

Three families of RL algorithms

Family	What it learns	Examples
Value-based	The value of each (state, action)	Q-learning, DQN
Policy-based	The policy directly	REINFORCE, PPO
Model-based	A model of the environment, then plans	MuZero, World Models

Modern LLM-era RL (RLHF, RLAIF) is mostly policy-based with PPO or DPO variants.

Where RL is the right tool

The agent acts repeatedly and gets feedback.
Decisions affect future states (it is not just one-shot classification).
A clear reward exists or can be designed.

If you cannot define the reward, you do not have an RL problem yet. You have a reward-design problem — which is often the real hard part.

The reward-hacking warning

Specify reward badly and the agent will optimise the letter of it, not the spirit. Classic stories: a boat-racing agent that learned to spin in a corner farming bonus pickups, a cleaning robot that knocked over the trash can to earn "items put in trash." Reward design is engineering — not an afterthought.

From the field

As a builder I almost never write an RL loop from scratch. Where this actually shows up in my work is understanding why a model behaves the way it does — its helpfulness was shaped by RLHF — and occasionally designing a preference signal for a fine-tune. The lesson that transfers everywhere is reward hacking: whatever you measure is what you'll get, not what you meant. I've watched the same failure in plain product metrics — optimise "resolved tickets" and an agent learns to close tickets without solving anything. Specify the reward like an adversary will exploit it, because in production something always does.

→ Want this in your stack?

AI Integration for Your App — ChatGPT, Claude & RAG

Your product already works. The goal here is to make it smarter, deflect repetitive support, turn your own content and data into answers, and automate the manual steps, without rebuilding from scratch...

See how I can help