A Fundamentally Different Kind of Learning
Reinforcement Learning (RL) is unlike anything in supervised learning. There are no labeled examples. No training datasets. Instead, an agent learns by interacting with an environment, taking actions, receiving rewards, and gradually discovering the optimal policy — the strategy that maximizes long-term reward.
The Three Paradigms of Machine Learning
SUPERVISED LEARNING:
"Here are 10,000 cat photos labeled 'cat'. Learn what cats look like."
Input → Model → Prediction vs. Label → Adjust
UNSUPERVISED LEARNING:
"Here are 10,000 photos. Find patterns and groups."
Input → Model → Clusters/Patterns
REINFORCEMENT LEARNING:
"Here's a game. Figure out how to win. Good luck."
Agent → Action → Environment → Reward → Learn → Repeat
The Agent-Environment Interaction
┌──────────────┐
│ │
┌────│ ENVIRONMENT │◄───── Action (aₜ)
│ │ │ ↑
│ └──────────────┘ │
│ │
│ State (sₜ₊₁) │
│ Reward (rₜ₊₁) │
│ │
▼ │
┌──────────────┐ │
│ │────────────────────┘
│ AGENT │
│ (Policy) │
└──────────────┘
At each timestep t:
1. Agent observes state sₜ
2. Agent selects action aₜ based on its policy
3. Environment transitions to new state sₜ₊₁
4. Agent receives reward rₜ₊₁
5. Agent updates its policy based on the experience
Key RL Concepts
State (s): A representation of the current situation.
Chess: The board position
Pac-Man: Positions of Pac-Man, ghosts, pellets
Self-driving car: Camera images, speed, GPS coordinates
Stock trading: Price history, portfolio, indicators
Action (a): What the agent can do in a given state.
Chess: Move a piece
Pac-Man: Up, Down, Left, Right
Self-driving car: Steer, accelerate, brake
Stock trading: Buy, sell, hold
Reward (r): Immediate feedback signal after an action.
Chess: +1 for winning, -1 for losing, 0 otherwise
Pac-Man: +10 eat pellet, +200 eat ghost, -500 die
Self-driving car: +1 for staying in lane, -100 for crash
Stock trading: Change in portfolio value
Policy (π): The agent's strategy — a mapping from states to actions.
π(s) = a "In state s, take action a"
Deterministic policy: Always the same action for a given state
Stochastic policy: Probability distribution over actions
Good policy: In Pac-Man, move toward pellets, avoid ghosts
Bad policy: Move randomly regardless of ghost positions
The Exploration-Exploitation Dilemma
This is the fundamental challenge of RL:
EXPLOITATION: Use what you already know works
→ "This restaurant is good. Let's go there again."
→ Safe but might miss better options.
EXPLORATION: Try something new to discover potentially better options
→ "Let's try this new restaurant — it might be amazing."
→ Risky but might find something better.
RL agents must BALANCE both:
- Too much exploitation → Stuck with suboptimal strategy
- Too much exploration → Never converges on a good strategy
Epsilon-Greedy Strategy (most common solution):
With probability ε: Take a random action (EXPLORE)
With probability 1-ε: Take the best known action (EXPLOIT)
Start with ε = 1.0 (all exploration)
Gradually decrease to ε = 0.01 (mostly exploitation)
This lets the agent explore widely first, then focus on what works.
Episodic vs Continuing Tasks
Episodic: Has a clear beginning and end.
Chess game: Start → Moves → Checkmate (episode ends)
Pac-Man level: Start → Play → Die or clear level (episode ends)
Lunar Lander: Start → Fly → Land or crash (episode ends)
Continuing: Goes on indefinitely.
Stock trading: Always running, no natural endpoint
Robot balancing: Must balance forever
Temperature control: Continuous adjustment
The Return — Why Long-Term Thinking Matters
The agent does not just maximize immediate reward — it maximizes total future reward:
Return Gₜ = rₜ₊₁ + γ·rₜ₊₂ + γ²·rₜ₊₃ + γ³·rₜ₊₄ + ...
γ (gamma) = discount factor (0 to 1)
Controls how much the agent values future vs immediate rewards.
γ = 0.0 → Only cares about immediate reward (short-sighted)
γ = 0.5 → Values near-future, discounts distant future
γ = 0.99 → Values future almost as much as present (far-sighted)
γ = 1.0 → Values all future rewards equally (only for episodic tasks)
Example with γ = 0.9:
Reward sequence: [10, 5, 20, 3]
Return = 10 + 0.9(5) + 0.81(20) + 0.729(3) = 10 + 4.5 + 16.2 + 2.19 = 32.89
Real-World RL Applications
| Application | Agent | Environment | Reward |
|---|---|---|---|
| AlphaGo | Go player | Go board | Win/Loss |
| ChatGPT RLHF | Language model | Human evaluator | Human preference score |
| Robotics | Robot controller | Physical world | Task completion |
| Self-driving | Driving AI | Road simulator | Safety + efficiency |
| Game AI | Game player | Game engine | Score |
| Trading | Trading bot | Stock market | Profit/Loss |
Action Step
Reinforcement Learning is about learning through trial and error with a reward signal. In the next lesson, we formalize this with Markov Decision Processes and the Bellman equation — the mathematical foundation that makes Q-Learning possible.