DDPG — Deep Deterministic Policy Gradient for Continuous Control
DQN conquered discrete action spaces. But the real physical world — robotic arms, autonomous vehicles, drone flight, chemical process control — demands continuous actions. You do not choose "turn left" from a list; you choose a steering angle of -0.23 radians. DDPG was the breakthrough that brought deep RL to these continuous domains.
Why DQN Fails for Continuous Actions
DQN finds the best action by evaluating Q(s, a) for every action a and picking the maximum. With a discrete action space of 18 (Atari), this is easy. With a continuous action space of joint torques ranging from -1.0 to 1.0 in infinite precision, computing the maximum is mathematically intractable.
DDPG's solution: use a separate neural network (the actor) to directly output the optimal action without searching.
The Four Neural Networks of DDPG
┌─────────────────────────────────────────────────────────┐
│ DDPG ARCHITECTURE │
│ │
│ Online Actor ─── soft update ──▶ Target Actor │
│ μ(s; θ^μ) μ'(s; θ^μ') │
│ │ │ │
│ ↓ ↓ │
│ Online Critic ─── soft update ──▶ Target Critic │
│ Q(s,a; θ^Q) Q'(s,a; θ^Q') │
└─────────────────────────────────────────────────────────┘
- Online Actor — learns the policy μ(s) → outputs continuous action
- Target Actor — stable copy for computing target Q-values
- Online Critic — learns Q(s, a) — how good is taking action a in state s
- Target Critic — stable copy for loss computation
The Deterministic Policy Gradient Theorem
The key insight from Silver et al. (2014): for a deterministic policy μ(s), the gradient of expected reward with respect to policy parameters is:
∇_θμ J ≈ E[∇_a Q(s, a)|_(a=μ(s)) · ∇_θμ μ(s|θ^μ)]
In plain English: improve the policy by following the gradient of the critic's Q-values with respect to the actor's output. The critic teaches the actor which direction to move its output to get higher Q-values.
Ornstein-Uhlenbeck Noise — Exploration in Continuous Spaces
Discrete action RL uses epsilon-greedy for exploration. Continuous action RL needs temporally correlated noise to produce smooth, physically plausible exploratory movements:
import numpy as np
class OUNoise:
"""Ornstein-Uhlenbeck process for temporally correlated exploration noise."""
def __init__(self, action_dim: int, mu: float = 0.0,
theta: float = 0.15, sigma: float = 0.2):
self.action_dim = action_dim
self.mu = mu
self.theta = theta
self.sigma = sigma
self.state = np.ones(self.action_dim) * self.mu
def reset(self):
self.state = np.ones(self.action_dim) * self.mu
def sample(self) -> np.ndarray:
dx = self.theta * (self.mu - self.state) + \
self.sigma * np.random.randn(self.action_dim)
self.state += dx
return self.state
The OU process generates noise that changes smoothly over time — a robotic arm does not jerk randomly; it drifts and returns. This matches how physical systems explore.
Soft Target Updates — Polyak Averaging
Instead of copying target network weights periodically (DQN-style), DDPG uses slow blending:
tau = 0.005 # Polyak averaging factor
def soft_update(online_network, target_network, tau: float = 0.005):
for online_param, target_param in zip(
online_network.parameters(),
target_network.parameters()
):
target_param.data.copy_(
tau * online_param.data + (1.0 - tau) * target_param.data
)
With τ = 0.005, the target network moves 0.5% toward the online network every step. This extreme stability prevents the Q-value divergence that plagued early continuous RL.
DDPG Algorithm Pseudocode
Initialize replay buffer R
Initialize actor μ(s; θ^μ) and critic Q(s,a; θ^Q) with random weights
Initialize target networks: θ^μ' ← θ^μ, θ^Q' ← θ^Q
Initialize OU noise process N
for episode = 1 to M:
s_0 = environment.reset()
N.reset()
for t = 1 to T:
# Select action with exploration noise
a_t = μ(s_t; θ^μ) + N.sample()
a_t = clip(a_t, -1, 1)
# Execute action, observe reward and next state
r_t, s_{t+1}, done = environment.step(a_t)
# Store transition in replay buffer
R.store(s_t, a_t, r_t, s_{t+1}, done)
# Sample random minibatch from R
batch = R.sample(batch_size=64)
# Compute target Q-values using target networks
y_i = r_i + γ * Q'(s_{i+1}, μ'(s_{i+1}; θ^μ'); θ^Q')
# Update critic by minimizing: L = (1/N) Σ(y_i - Q(s_i, a_i; θ^Q))²
critic.update(batch, y_i)
# Update actor using sampled policy gradient
actor.update(∇_θμ Q(s_i, μ(s_i; θ^μ); θ^Q))
# Soft update target networks
soft_update(actor, target_actor, tau=0.005)
soft_update(critic, target_critic, tau=0.005)
DDPG vs Its Successors
| Algorithm | Key Improvement Over DDPG |
|---|---|
| TD3 | Twin critics to reduce Q-value overestimation |
| SAC | Entropy maximization for better exploration |
| PPO (continuous) | On-policy, more stable but less sample efficient |
Key Takeaways
- DDPG combines DQN (replay buffer, target networks) with actor-critic for continuous actions
- The actor outputs a specific action; the critic evaluates it — they train each other
- OU noise provides temporally smooth exploration suited to physical control tasks
- Polyak averaging (τ = 0.005) gives extremely stable target networks
- DDPG is superseded by TD3 and SAC in practice, but remains essential conceptually