Skip to main content
Chapter 2 Reinforcement Learning Fundamentals & Q-Learning

How Reinforcement Learning Works — The Agent-Environment Loop

22 min read Lesson 6 / 50 Preview

A Fundamentally Different Kind of Learning

Reinforcement Learning (RL) is unlike anything in supervised learning. There are no labeled examples. No training datasets. Instead, an agent learns by interacting with an environment, taking actions, receiving rewards, and gradually discovering the optimal policy — the strategy that maximizes long-term reward.

The Three Paradigms of Machine Learning

SUPERVISED LEARNING:
"Here are 10,000 cat photos labeled 'cat'. Learn what cats look like."
Input → Model → Prediction vs. Label → Adjust

UNSUPERVISED LEARNING:
"Here are 10,000 photos. Find patterns and groups."
Input → Model → Clusters/Patterns

REINFORCEMENT LEARNING:
"Here's a game. Figure out how to win. Good luck."
Agent → Action → Environment → Reward → Learn → Repeat

The Agent-Environment Interaction

         ┌──────────────┐
         │              │
    ┌────│  ENVIRONMENT │◄───── Action (aₜ)
    │    │              │           ↑
    │    └──────────────┘           │
    │                               │
    │  State (sₜ₊₁)                │
    │  Reward (rₜ₊₁)               │
    │                               │
    ▼                               │
┌──────────────┐                    │
│              │────────────────────┘
│    AGENT     │
│   (Policy)   │
└──────────────┘

At each timestep t:
1. Agent observes state sₜ
2. Agent selects action aₜ based on its policy
3. Environment transitions to new state sₜ₊₁
4. Agent receives reward rₜ₊₁
5. Agent updates its policy based on the experience

Key RL Concepts

State (s): A representation of the current situation.

Chess: The board position
Pac-Man: Positions of Pac-Man, ghosts, pellets
Self-driving car: Camera images, speed, GPS coordinates
Stock trading: Price history, portfolio, indicators

Action (a): What the agent can do in a given state.

Chess: Move a piece
Pac-Man: Up, Down, Left, Right
Self-driving car: Steer, accelerate, brake
Stock trading: Buy, sell, hold

Reward (r): Immediate feedback signal after an action.

Chess: +1 for winning, -1 for losing, 0 otherwise
Pac-Man: +10 eat pellet, +200 eat ghost, -500 die
Self-driving car: +1 for staying in lane, -100 for crash
Stock trading: Change in portfolio value

Policy (π): The agent's strategy — a mapping from states to actions.

π(s) = a    "In state s, take action a"

Deterministic policy: Always the same action for a given state
Stochastic policy: Probability distribution over actions

Good policy: In Pac-Man, move toward pellets, avoid ghosts
Bad policy: Move randomly regardless of ghost positions

The Exploration-Exploitation Dilemma

This is the fundamental challenge of RL:

EXPLOITATION: Use what you already know works
→ "This restaurant is good. Let's go there again."
→ Safe but might miss better options.

EXPLORATION: Try something new to discover potentially better options
→ "Let's try this new restaurant — it might be amazing."
→ Risky but might find something better.

RL agents must BALANCE both:
- Too much exploitation → Stuck with suboptimal strategy
- Too much exploration → Never converges on a good strategy

Epsilon-Greedy Strategy (most common solution):

With probability ε: Take a random action (EXPLORE)
With probability 1-ε: Take the best known action (EXPLOIT)

Start with ε = 1.0 (all exploration)
Gradually decrease to ε = 0.01 (mostly exploitation)
This lets the agent explore widely first, then focus on what works.

Episodic vs Continuing Tasks

Episodic: Has a clear beginning and end.

Chess game: Start → Moves → Checkmate (episode ends)
Pac-Man level: Start → Play → Die or clear level (episode ends)
Lunar Lander: Start → Fly → Land or crash (episode ends)

Continuing: Goes on indefinitely.

Stock trading: Always running, no natural endpoint
Robot balancing: Must balance forever
Temperature control: Continuous adjustment

The Return — Why Long-Term Thinking Matters

The agent does not just maximize immediate reward — it maximizes total future reward:

Return Gₜ = rₜ₊₁ + γ·rₜ₊₂ + γ²·rₜ₊₃ + γ³·rₜ₊₄ + ...

γ (gamma) = discount factor (0 to 1)
Controls how much the agent values future vs immediate rewards.

γ = 0.0 → Only cares about immediate reward (short-sighted)
γ = 0.5 → Values near-future, discounts distant future
γ = 0.99 → Values future almost as much as present (far-sighted)
γ = 1.0 → Values all future rewards equally (only for episodic tasks)

Example with γ = 0.9:
Reward sequence: [10, 5, 20, 3]
Return = 10 + 0.9(5) + 0.81(20) + 0.729(3) = 10 + 4.5 + 16.2 + 2.19 = 32.89

Real-World RL Applications

Application Agent Environment Reward
AlphaGo Go player Go board Win/Loss
ChatGPT RLHF Language model Human evaluator Human preference score
Robotics Robot controller Physical world Task completion
Self-driving Driving AI Road simulator Safety + efficiency
Game AI Game player Game engine Score
Trading Trading bot Stock market Profit/Loss

Action Step

Reinforcement Learning is about learning through trial and error with a reward signal. In the next lesson, we formalize this with Markov Decision Processes and the Bellman equation — the mathematical foundation that makes Q-Learning possible.