Skip to main content
Chapter 10 Advanced Techniques — DDPG, World Models, Evolution Strategies & Beyond

DDPG — Deep Deterministic Policy Gradient for Continuous Control

22 min read Lesson 46 / 50 Preview

DDPG — Deep Deterministic Policy Gradient for Continuous Control

DQN conquered discrete action spaces. But the real physical world — robotic arms, autonomous vehicles, drone flight, chemical process control — demands continuous actions. You do not choose "turn left" from a list; you choose a steering angle of -0.23 radians. DDPG was the breakthrough that brought deep RL to these continuous domains.

Why DQN Fails for Continuous Actions

DQN finds the best action by evaluating Q(s, a) for every action a and picking the maximum. With a discrete action space of 18 (Atari), this is easy. With a continuous action space of joint torques ranging from -1.0 to 1.0 in infinite precision, computing the maximum is mathematically intractable.

DDPG's solution: use a separate neural network (the actor) to directly output the optimal action without searching.

The Four Neural Networks of DDPG

┌─────────────────────────────────────────────────────────┐
│                    DDPG ARCHITECTURE                     │
│                                                          │
│  Online Actor  ─── soft update ──▶  Target Actor        │
│  μ(s; θ^μ)                          μ'(s; θ^μ')         │
│       │                                   │              │
│       ↓                                   ↓              │
│  Online Critic ─── soft update ──▶  Target Critic       │
│  Q(s,a; θ^Q)                         Q'(s,a; θ^Q')      │
└─────────────────────────────────────────────────────────┘
  • Online Actor — learns the policy μ(s) → outputs continuous action
  • Target Actor — stable copy for computing target Q-values
  • Online Critic — learns Q(s, a) — how good is taking action a in state s
  • Target Critic — stable copy for loss computation

The Deterministic Policy Gradient Theorem

The key insight from Silver et al. (2014): for a deterministic policy μ(s), the gradient of expected reward with respect to policy parameters is:

∇_θμ J ≈ E[∇_a Q(s, a)|_(a=μ(s)) · ∇_θμ μ(s|θ^μ)]

In plain English: improve the policy by following the gradient of the critic's Q-values with respect to the actor's output. The critic teaches the actor which direction to move its output to get higher Q-values.

Ornstein-Uhlenbeck Noise — Exploration in Continuous Spaces

Discrete action RL uses epsilon-greedy for exploration. Continuous action RL needs temporally correlated noise to produce smooth, physically plausible exploratory movements:

import numpy as np

class OUNoise:
    """Ornstein-Uhlenbeck process for temporally correlated exploration noise."""
    def __init__(self, action_dim: int, mu: float = 0.0,
                 theta: float = 0.15, sigma: float = 0.2):
        self.action_dim = action_dim
        self.mu = mu
        self.theta = theta
        self.sigma = sigma
        self.state = np.ones(self.action_dim) * self.mu

    def reset(self):
        self.state = np.ones(self.action_dim) * self.mu

    def sample(self) -> np.ndarray:
        dx = self.theta * (self.mu - self.state) + \
             self.sigma * np.random.randn(self.action_dim)
        self.state += dx
        return self.state

The OU process generates noise that changes smoothly over time — a robotic arm does not jerk randomly; it drifts and returns. This matches how physical systems explore.

Soft Target Updates — Polyak Averaging

Instead of copying target network weights periodically (DQN-style), DDPG uses slow blending:

tau = 0.005  # Polyak averaging factor

def soft_update(online_network, target_network, tau: float = 0.005):
    for online_param, target_param in zip(
        online_network.parameters(),
        target_network.parameters()
    ):
        target_param.data.copy_(
            tau * online_param.data + (1.0 - tau) * target_param.data
        )

With τ = 0.005, the target network moves 0.5% toward the online network every step. This extreme stability prevents the Q-value divergence that plagued early continuous RL.

DDPG Algorithm Pseudocode

Initialize replay buffer R
Initialize actor μ(s; θ^μ) and critic Q(s,a; θ^Q) with random weights
Initialize target networks: θ^μ' ← θ^μ, θ^Q' ← θ^Q
Initialize OU noise process N

for episode = 1 to M:
    s_0 = environment.reset()
    N.reset()

    for t = 1 to T:
        # Select action with exploration noise
        a_t = μ(s_t; θ^μ) + N.sample()
        a_t = clip(a_t, -1, 1)

        # Execute action, observe reward and next state
        r_t, s_{t+1}, done = environment.step(a_t)

        # Store transition in replay buffer
        R.store(s_t, a_t, r_t, s_{t+1}, done)

        # Sample random minibatch from R
        batch = R.sample(batch_size=64)

        # Compute target Q-values using target networks
        y_i = r_i + γ * Q'(s_{i+1}, μ'(s_{i+1}; θ^μ'); θ^Q')

        # Update critic by minimizing: L = (1/N) Σ(y_i - Q(s_i, a_i; θ^Q))²
        critic.update(batch, y_i)

        # Update actor using sampled policy gradient
        actor.update(∇_θμ Q(s_i, μ(s_i; θ^μ); θ^Q))

        # Soft update target networks
        soft_update(actor, target_actor, tau=0.005)
        soft_update(critic, target_critic, tau=0.005)

DDPG vs Its Successors

Algorithm Key Improvement Over DDPG
TD3 Twin critics to reduce Q-value overestimation
SAC Entropy maximization for better exploration
PPO (continuous) On-policy, more stable but less sample efficient

Key Takeaways

  • DDPG combines DQN (replay buffer, target networks) with actor-critic for continuous actions
  • The actor outputs a specific action; the critic evaluates it — they train each other
  • OU noise provides temporally smooth exploration suited to physical control tasks
  • Polyak averaging (τ = 0.005) gives extremely stable target networks
  • DDPG is superseded by TD3 and SAC in practice, but remains essential conceptually