Actor-Critic Architecture — Two Networks, One Goal
Reinforcement learning has historically split into two camps: value-based methods (DQN) that learn what situations are worth, and policy gradient methods (REINFORCE) that learn which actions to take. Actor-Critic methods combine both into a single coherent framework — and the result is greater stability, faster convergence, and the foundation for every modern RL algorithm including PPO, SAC, and A3C.
The Two Networks
State s
│
├──▶ Actor π(a|s;θ) ──▶ Action distribution (what to do)
│
└──▶ Critic V(s;w) ──▶ Scalar value estimate (how good this is)
- Actor — A neural network parameterized by
θthat outputs a probability distribution over actions given states. It is the policy π(a|s). - Critic — A neural network parameterized by
wthat outputs a single scalar V(s), the estimated expected cumulative reward from statesunder the current policy.
They can share early layers (convolutional backbone for image inputs) and diverge into separate heads, which is the most common architecture for efficiency.
Why Not Just One?
| Method | What It Learns | Weakness |
|---|---|---|
| DQN (value-only) | Q(s,a) for every action | Fails in continuous action spaces; no explicit policy |
| REINFORCE (policy-only) | π(a|s) directly | Extremely high variance in gradient estimates |
| Actor-Critic | Both π and V | Low variance, stable gradients, scales to continuous actions |
REINFORCE estimates gradients by recording full episode returns, which swing wildly between episodes. The critic acts as a baseline that dramatically reduces that variance without introducing bias.
The Advantage Function
The core insight of Actor-Critic is replacing the raw return G_t with the advantage:
A(s, a) = Q(s, a) - V(s)
Q(s, a)— value of taking actionain statesV(s)— average value of statesunder the current policy
If A(s,a) > 0, this action is better than average — increase its probability. If A(s,a) < 0, it's worse than average — decrease its probability.
In practice we rarely compute Q directly. Instead we use the TD error as an unbiased one-step estimate of the advantage:
δ_t = r_t + γ V(s_{t+1}) - V(s_t)
This single scalar δ_t is the signal that trains both networks:
- Actor loss:
L_actor = -log π(a_t|s_t) * δ_t(policy gradient with advantage) - Critic loss:
L_critic = δ_t²(mean squared TD error)
Minimal PyTorch Skeleton
import torch
import torch.nn as nn
class ActorCritic(nn.Module):
def __init__(self, obs_dim, act_dim):
super().__init__()
self.shared = nn.Sequential(
nn.Linear(obs_dim, 128), nn.ReLU(),
nn.Linear(128, 128), nn.ReLU(),
)
self.actor_head = nn.Linear(128, act_dim) # logits
self.critic_head = nn.Linear(128, 1) # scalar V(s)
def forward(self, x):
feat = self.shared(x)
logits = self.actor_head(feat)
value = self.critic_head(feat)
return logits, value
Key Takeaways
- Actor-Critic is a meta-framework, not a single algorithm — A3C, A2C, PPO, and SAC all build on it.
- The advantage function is the secret weapon: it tells the actor how much better or worse an action was relative to the baseline.
- Shared backbone layers mean both networks improve from the same feature representation, leading to faster learning on high-dimensional inputs.
- TD error as a one-step advantage estimate keeps computation cheap and enables online (non-episodic) learning.
In the next lesson we will see how A3C supercharges this framework with asynchronous parallel workers to eliminate the need for experience replay.