The Non-Linear Magic
Without activation functions, a neural network is just a linear function — no matter how many layers you stack, the output remains a linear combination of inputs. Activation functions introduce non-linearity, enabling networks to learn complex, curved decision boundaries.
Why Non-Linearity Matters
WITHOUT activation functions:
Layer 1: y₁ = W₁x + b₁
Layer 2: y₂ = W₂y₁ + b₂ = W₂(W₁x + b₁) + b₂ = W₂W₁x + W₂b₁ + b₂
→ Still a linear function! Two layers = one layer. Useless depth.
WITH activation functions:
Layer 1: y₁ = f(W₁x + b₁)
Layer 2: y₂ = f(W₂y₁ + b₂)
→ Non-linear composition! Can approximate ANY continuous function.
The Activation Functions You Must Know
1. Sigmoid (σ)
σ(x) = 1 / (1 + e⁻ˣ)
Output range: (0, 1)
Shape: S-curve
1 ─────────────────────
│ ╱╱
0.5 ─│────────────╱╱────
│ ╱╱
0 ─╱╱─────────────────
-6 -3 0 3 6
When to use: Output layer for binary classification (probability of class 1).
Problems:
- Vanishing gradient — For very large or very small inputs, the gradient approaches zero, making learning extremely slow in deep networks
- Not zero-centered — Outputs are always positive, which can slow gradient-based optimization
2. Tanh (Hyperbolic Tangent)
tanh(x) = (eˣ - e⁻ˣ) / (eˣ + e⁻ˣ)
Output range: (-1, 1)
Shape: S-curve centered at zero
1 ─────────────────────
│ ╱╱
0 ─│────────────╱╱────
│ ╱╱
-1 ─╱╱─────────────────
-6 -3 0 3 6
When to use: Hidden layers when you need zero-centered output. Better than sigmoid for hidden layers but still suffers from vanishing gradients.
3. ReLU (Rectified Linear Unit)
ReLU(x) = max(0, x)
Output range: [0, ∞)
Shape: Ramp function
│ ╱
│ ╱
│ ╱
0 ─━━━━━╱──────────
│ ╱
-3 0 3 6
When to use: Default choice for hidden layers in almost all modern networks.
Advantages:
- Computationally efficient (just a threshold)
- No vanishing gradient for positive values
- Induces sparsity (negative values become exactly zero)
Problem: Dying ReLU — Neurons that output zero for all inputs stop learning forever. Solutions: Leaky ReLU, ELU, or careful initialization.
4. Softmax
softmax(xᵢ) = eˣⁱ / Σⱼ eˣʲ
Converts raw scores into probabilities that sum to 1.
Input: [2.0, 1.0, 0.1]
Output: [0.659, 0.242, 0.099] ← Sum = 1.0
When to use: Output layer for multi-class classification. Tells you the probability of each class.
Choosing the Right Activation Function
RULE OF THUMB:
Hidden layers → ReLU (or Leaky ReLU for safety)
Binary output → Sigmoid
Multi-class output → Softmax
Value prediction → Linear (no activation) or Tanh
For Reinforcement Learning specifically:
- Q-value networks → Linear output (Q-values can be any real number)
- Policy networks → Softmax (discrete actions) or Tanh (continuous actions)
- Advantage function → Linear output
Variants of ReLU
| Variant | Formula | Advantage |
|---|---|---|
| Leaky ReLU | max(0.01x, x) | Small gradient for negatives, prevents dying |
| ELU | x if x>0, α(eˣ-1) if x≤0 | Smoother, zero-centered negatives |
| GELU | x·Φ(x) | Used in Transformers (GPT, BERT) |
| Swish | x·σ(x) | Self-gated, smooth, used in EfficientNet |
GELU is the activation function used in most modern Transformer architectures including GPT and BERT. It provides a smooth approximation to ReLU with probabilistic gating.
Action Step
When we build our first neural network in the next chapters, you will see these activation functions in action. The key takeaway: ReLU for hidden layers, Softmax/Sigmoid for outputs — this simple rule covers 90% of practical cases.