Teaching AI to See and Act
What if the agent only sees raw screen pixels? Deep Convolutional Q-Learning (DCQN) uses CNNs to process visual input, enabling agents to master games from pixel data alone.
The DCQN Architecture
RAW GAME FRAME (84×84×4 grayscale, stacked)
│
┌────┴────────────────────────────┐
│ CONV1: 32 filters, 8×8, str 4 │ → 20×20×32
│ CONV2: 64 filters, 4×4, str 2 │ → 9×9×64
│ CONV3: 64 filters, 3×3, str 1 │ → 7×7×64
│ FLATTEN → FC 512 → FC actions │ → Q-values
└─────────────────────────────────┘
Why Frame Stacking?
A single frame shows position but not motion. Stacking 4 frames provides velocity information:
Frame t-3 Frame t-2 Frame t-1 Frame t
● ● ● ● → Ghost moving RIGHT
PyTorch Implementation
import torch.nn as nn
class DCQN(nn.Module):
def __init__(self, num_actions):
super().__init__()
self.conv = nn.Sequential(
nn.Conv2d(4, 32, 8, stride=4), nn.ReLU(),
nn.Conv2d(32, 64, 4, stride=2), nn.ReLU(),
nn.Conv2d(64, 64, 3, stride=1), nn.ReLU(),
)
self.fc = nn.Sequential(
nn.Flatten(),
nn.Linear(64*7*7, 512), nn.ReLU(),
nn.Linear(512, num_actions),
)
def forward(self, x):
return self.fc(self.conv(x.float() / 255.0))
DCQN vs DQN
| Feature | DQN | DCQN |
|---|---|---|
| Input | 8 numbers | 84×84×4 pixels |
| Network | FC layers | Conv + FC layers |
| Parameters | ~33K | ~1.7M |
| Training time | ~30 min CPU | ~4-12 hrs GPU |