PPO — The Algorithm Behind ChatGPT's RLHF
Proximal Policy Optimization (PPO) is not an academic curiosity — it is the algorithm that powers RLHF (Reinforcement Learning from Human Feedback), the training phase that gives ChatGPT, Claude, and Gemini their conversational quality. OpenAI published PPO in 2017, and it has remained the dominant RL algorithm in production systems ever since. Understanding PPO means understanding the engine underneath modern AI.
The Core Problem: Catastrophic Policy Updates
In standard policy gradient methods, a poorly chosen learning rate causes catastrophic updates — one bad gradient step can destroy a well-trained policy because the policy distribution shifts so far that all previously good actions now receive near-zero probability.
The naive solution (reduce learning rate) is too conservative — training slows to a crawl. We need a principled constraint on how much the policy can change per update.
TRPO: The Right Idea, Wrong Execution
Trust Region Policy Optimization (TRPO, 2015) solved this by adding a hard KL-divergence constraint:
maximize E[r(θ) * A_t]
subject to KL(π_old || π_new) ≤ δ
This works mathematically but requires computing second-order derivatives (conjugate gradient + Fisher information matrix), making it computationally prohibitive for large networks.
PPO: The Clipped Surrogate Objective
PPO keeps the spirit of TRPO's trust region but replaces the constraint with a simple clipping operation:
L^CLIP(θ) = E_t[ min(r_t(θ) · Â_t, clip(r_t(θ), 1-ε, 1+ε) · Â_t) ]
Where:
r_t(θ) = π_new(a_t|s_t) / π_old(a_t|s_t)— the probability ratioÂ_t— the estimated advantageε = 0.2— the clipping range (standard default)
r_t(θ) · Â_t (unclipped)
│
┌─────────────┤
│ │
clip ────┤ [1-ε, 1+ε] region ├──── clip
│ │
└─────────────┘
min of both = actual gradient signal
Intuition: If a gradient update would push the probability ratio outside [0.8, 1.2], clip it. The min operator ensures we only clip when it would be beneficial to the policy — we never artificially improve the objective beyond the trust region.
GAE — Generalized Advantage Estimation
PPO pairs with GAE (Schulman 2016) for advantage estimation:
Â_t = δ_t + (γλ)δ_{t+1} + (γλ)²δ_{t+2} + ...
where δ_t = r_t + γV(s_{t+1}) - V(s_t) (TD error)
λ ∈ [0,1] controls bias-variance tradeoff
λ=0→ pure 1-step TD (low variance, high bias)λ=1→ full Monte Carlo (unbiased, high variance)λ=0.95→ sweet spot used in most PPO implementations
def compute_gae(rewards, values, dones, gamma=0.99, lam=0.95):
advantages = []
gae = 0
for t in reversed(range(len(rewards))):
delta = rewards[t] + gamma * values[t+1] * (1 - dones[t]) - values[t]
gae = delta + gamma * lam * (1 - dones[t]) * gae
advantages.insert(0, gae)
return advantages
Why RLHF Uses PPO
In RLHF:
- A reward model (trained on human preferences) replaces the environment reward
- The language model is the policy π(token | context)
- PPO updates the LM to maximize the reward model score
- A KL penalty
β·KL(π_new || π_ref)prevents the model from drifting too far from the reference policy (important for not forgetting language quality)
Total RLHF reward = RewardModel(response) - β * KL(π_new || π_SFT)
PPO's clipping is perfect for this: it keeps the LM policy close to the supervised fine-tuned (SFT) model while optimizing for human preferences.
PPO vs TRPO
| PPO | TRPO | |
|---|---|---|
| Constraint enforcement | Soft (clipping) | Hard (KL constraint) |
| Computation | First-order (Adam) | Second-order (conjugate gradient) |
| Implementation complexity | Low | High |
| Performance | Comparable | Marginally better in theory |
| Production use | Dominant | Rare |
Key Takeaways
- The probability ratio
r_t(θ)is the central quantity in PPO. When it exceeds [1-ε, 1+ε], the gradient is clipped to zero — the update stops naturally. - GAE with λ=0.95 and γ=0.99 is the standard advantage estimation recipe — start here before tuning.
- PPO performs multiple gradient update epochs on the same batch of collected data (typically 4–10 epochs), making it more sample-efficient than A3C.
- In RLHF, PPO is combined with a KL penalty against the reference policy to prevent reward hacking — a crucial safety mechanism.