Skip to main content
Chapter 4 Deep Convolutional Q-Learning — AI That Learns from Pixels

From Game Pixels to AI Actions — The DCQN Architecture

22 min read Lesson 16 / 50 Preview

Teaching AI to See and Act

What if the agent only sees raw screen pixels? Deep Convolutional Q-Learning (DCQN) uses CNNs to process visual input, enabling agents to master games from pixel data alone.

The DCQN Architecture

RAW GAME FRAME (84×84×4 grayscale, stacked)
         │
    ┌────┴────────────────────────────┐
    │ CONV1: 32 filters, 8×8, str 4  │ → 20×20×32
    │ CONV2: 64 filters, 4×4, str 2  │ → 9×9×64
    │ CONV3: 64 filters, 3×3, str 1  │ → 7×7×64
    │ FLATTEN → FC 512 → FC actions   │ → Q-values
    └─────────────────────────────────┘

Why Frame Stacking?

A single frame shows position but not motion. Stacking 4 frames provides velocity information:

Frame t-3  Frame t-2  Frame t-1  Frame t
  ●           ●           ●          ● → Ghost moving RIGHT

PyTorch Implementation

import torch.nn as nn

class DCQN(nn.Module):
    def __init__(self, num_actions):
        super().__init__()
        self.conv = nn.Sequential(
            nn.Conv2d(4, 32, 8, stride=4), nn.ReLU(),
            nn.Conv2d(32, 64, 4, stride=2), nn.ReLU(),
            nn.Conv2d(64, 64, 3, stride=1), nn.ReLU(),
        )
        self.fc = nn.Sequential(
            nn.Flatten(),
            nn.Linear(64*7*7, 512), nn.ReLU(),
            nn.Linear(512, num_actions),
        )
    def forward(self, x):
        return self.fc(self.conv(x.float() / 255.0))

DCQN vs DQN

Feature DQN DCQN
Input 8 numbers 84×84×4 pixels
Network FC layers Conv + FC layers
Parameters ~33K ~1.7M
Training time ~30 min CPU ~4-12 hrs GPU