Skip to main content
Chapter 5 A3C — Asynchronous Advantage Actor-Critic

Actor-Critic Architecture — Two Networks, One Goal

22 min read Lesson 21 / 50 Preview

Actor-Critic Architecture — Two Networks, One Goal

Reinforcement learning has historically split into two camps: value-based methods (DQN) that learn what situations are worth, and policy gradient methods (REINFORCE) that learn which actions to take. Actor-Critic methods combine both into a single coherent framework — and the result is greater stability, faster convergence, and the foundation for every modern RL algorithm including PPO, SAC, and A3C.

The Two Networks

State s
  │
  ├──▶ Actor  π(a|s;θ)  ──▶  Action distribution (what to do)
  │
  └──▶ Critic V(s;w)    ──▶  Scalar value estimate (how good this is)
  • Actor — A neural network parameterized by θ that outputs a probability distribution over actions given state s. It is the policy π(a|s).
  • Critic — A neural network parameterized by w that outputs a single scalar V(s), the estimated expected cumulative reward from state s under the current policy.

They can share early layers (convolutional backbone for image inputs) and diverge into separate heads, which is the most common architecture for efficiency.

Why Not Just One?

Method What It Learns Weakness
DQN (value-only) Q(s,a) for every action Fails in continuous action spaces; no explicit policy
REINFORCE (policy-only) π(a|s) directly Extremely high variance in gradient estimates
Actor-Critic Both π and V Low variance, stable gradients, scales to continuous actions

REINFORCE estimates gradients by recording full episode returns, which swing wildly between episodes. The critic acts as a baseline that dramatically reduces that variance without introducing bias.

The Advantage Function

The core insight of Actor-Critic is replacing the raw return G_t with the advantage:

A(s, a) = Q(s, a) - V(s)
  • Q(s, a) — value of taking action a in state s
  • V(s) — average value of state s under the current policy

If A(s,a) > 0, this action is better than average — increase its probability. If A(s,a) < 0, it's worse than average — decrease its probability.

In practice we rarely compute Q directly. Instead we use the TD error as an unbiased one-step estimate of the advantage:

δ_t = r_t + γ V(s_{t+1}) - V(s_t)

This single scalar δ_t is the signal that trains both networks:

  • Actor loss: L_actor = -log π(a_t|s_t) * δ_t (policy gradient with advantage)
  • Critic loss: L_critic = δ_t² (mean squared TD error)

Minimal PyTorch Skeleton

import torch
import torch.nn as nn

class ActorCritic(nn.Module):
    def __init__(self, obs_dim, act_dim):
        super().__init__()
        self.shared = nn.Sequential(
            nn.Linear(obs_dim, 128), nn.ReLU(),
            nn.Linear(128, 128),    nn.ReLU(),
        )
        self.actor_head  = nn.Linear(128, act_dim)   # logits
        self.critic_head = nn.Linear(128, 1)          # scalar V(s)

    def forward(self, x):
        feat   = self.shared(x)
        logits = self.actor_head(feat)
        value  = self.critic_head(feat)
        return logits, value

Key Takeaways

  • Actor-Critic is a meta-framework, not a single algorithm — A3C, A2C, PPO, and SAC all build on it.
  • The advantage function is the secret weapon: it tells the actor how much better or worse an action was relative to the baseline.
  • Shared backbone layers mean both networks improve from the same feature representation, leading to faster learning on high-dimensional inputs.
  • TD error as a one-step advantage estimate keeps computation cheap and enables online (non-episodic) learning.

In the next lesson we will see how A3C supercharges this framework with asynchronous parallel workers to eliminate the need for experience replay.