Skip to main content
Chapter 6 PPO & SAC — Modern Policy Optimization Algorithms

PPO — The Algorithm Behind ChatGPT's RLHF

22 min read Lesson 26 / 50 Preview

PPO — The Algorithm Behind ChatGPT's RLHF

Proximal Policy Optimization (PPO) is not an academic curiosity — it is the algorithm that powers RLHF (Reinforcement Learning from Human Feedback), the training phase that gives ChatGPT, Claude, and Gemini their conversational quality. OpenAI published PPO in 2017, and it has remained the dominant RL algorithm in production systems ever since. Understanding PPO means understanding the engine underneath modern AI.

The Core Problem: Catastrophic Policy Updates

In standard policy gradient methods, a poorly chosen learning rate causes catastrophic updates — one bad gradient step can destroy a well-trained policy because the policy distribution shifts so far that all previously good actions now receive near-zero probability.

The naive solution (reduce learning rate) is too conservative — training slows to a crawl. We need a principled constraint on how much the policy can change per update.

TRPO: The Right Idea, Wrong Execution

Trust Region Policy Optimization (TRPO, 2015) solved this by adding a hard KL-divergence constraint:

maximize  E[r(θ) * A_t]
subject to  KL(π_old || π_new) ≤ δ

This works mathematically but requires computing second-order derivatives (conjugate gradient + Fisher information matrix), making it computationally prohibitive for large networks.

PPO: The Clipped Surrogate Objective

PPO keeps the spirit of TRPO's trust region but replaces the constraint with a simple clipping operation:

L^CLIP(θ) = E_t[ min(r_t(θ) · Â_t,  clip(r_t(θ), 1-ε, 1+ε) · Â_t) ]

Where:

  • r_t(θ) = π_new(a_t|s_t) / π_old(a_t|s_t) — the probability ratio
  • Â_t — the estimated advantage
  • ε = 0.2 — the clipping range (standard default)
                r_t(θ) · Â_t  (unclipped)
                       │
         ┌─────────────┤
         │             │
clip ────┤    [1-ε, 1+ε] region    ├──── clip
         │             │
         └─────────────┘
       min of both = actual gradient signal

Intuition: If a gradient update would push the probability ratio outside [0.8, 1.2], clip it. The min operator ensures we only clip when it would be beneficial to the policy — we never artificially improve the objective beyond the trust region.

GAE — Generalized Advantage Estimation

PPO pairs with GAE (Schulman 2016) for advantage estimation:

Â_t = δ_t + (γλ)δ_{t+1} + (γλ)²δ_{t+2} + ...

where δ_t = r_t + γV(s_{t+1}) - V(s_t)  (TD error)
      λ ∈ [0,1]  controls bias-variance tradeoff
  • λ=0 → pure 1-step TD (low variance, high bias)
  • λ=1 → full Monte Carlo (unbiased, high variance)
  • λ=0.95 → sweet spot used in most PPO implementations
def compute_gae(rewards, values, dones, gamma=0.99, lam=0.95):
    advantages = []
    gae = 0
    for t in reversed(range(len(rewards))):
        delta = rewards[t] + gamma * values[t+1] * (1 - dones[t]) - values[t]
        gae   = delta + gamma * lam * (1 - dones[t]) * gae
        advantages.insert(0, gae)
    return advantages

Why RLHF Uses PPO

In RLHF:

  1. A reward model (trained on human preferences) replaces the environment reward
  2. The language model is the policy π(token | context)
  3. PPO updates the LM to maximize the reward model score
  4. A KL penalty β·KL(π_new || π_ref) prevents the model from drifting too far from the reference policy (important for not forgetting language quality)
Total RLHF reward = RewardModel(response) - β * KL(π_new || π_SFT)

PPO's clipping is perfect for this: it keeps the LM policy close to the supervised fine-tuned (SFT) model while optimizing for human preferences.

PPO vs TRPO

PPO TRPO
Constraint enforcement Soft (clipping) Hard (KL constraint)
Computation First-order (Adam) Second-order (conjugate gradient)
Implementation complexity Low High
Performance Comparable Marginally better in theory
Production use Dominant Rare

Key Takeaways

  • The probability ratio r_t(θ) is the central quantity in PPO. When it exceeds [1-ε, 1+ε], the gradient is clipped to zero — the update stops naturally.
  • GAE with λ=0.95 and γ=0.99 is the standard advantage estimation recipe — start here before tuning.
  • PPO performs multiple gradient update epochs on the same batch of collected data (typically 4–10 epochs), making it more sample-efficient than A3C.
  • In RLHF, PPO is combined with a KL penalty against the reference policy to prevent reward hacking — a crucial safety mechanism.