Skip to main content
Chapter 1 AI Foundations — Neural Networks & Deep Learning Essentials

Activation Functions — ReLU, Sigmoid, Tanh & Softmax Explained

18 min read Lesson 2 / 50 Preview

The Non-Linear Magic

Without activation functions, a neural network is just a linear function — no matter how many layers you stack, the output remains a linear combination of inputs. Activation functions introduce non-linearity, enabling networks to learn complex, curved decision boundaries.

Why Non-Linearity Matters

WITHOUT activation functions:
Layer 1: y₁ = W₁x + b₁
Layer 2: y₂ = W₂y₁ + b₂ = W₂(W₁x + b₁) + b₂ = W₂W₁x + W₂b₁ + b₂
→ Still a linear function! Two layers = one layer. Useless depth.

WITH activation functions:
Layer 1: y₁ = f(W₁x + b₁)
Layer 2: y₂ = f(W₂y₁ + b₂)
→ Non-linear composition! Can approximate ANY continuous function.

The Activation Functions You Must Know

1. Sigmoid (σ)

σ(x) = 1 / (1 + e⁻ˣ)

Output range: (0, 1)
Shape: S-curve

     1 ─────────────────────
       │              ╱╱
  0.5 ─│────────────╱╱────
       │          ╱╱
     0 ─╱╱─────────────────
      -6  -3    0    3    6

When to use: Output layer for binary classification (probability of class 1).

Problems:

  • Vanishing gradient — For very large or very small inputs, the gradient approaches zero, making learning extremely slow in deep networks
  • Not zero-centered — Outputs are always positive, which can slow gradient-based optimization

2. Tanh (Hyperbolic Tangent)

tanh(x) = (eˣ - e⁻ˣ) / (eˣ + e⁻ˣ)

Output range: (-1, 1)
Shape: S-curve centered at zero

     1 ─────────────────────
       │              ╱╱
     0 ─│────────────╱╱────
       │          ╱╱
    -1 ─╱╱─────────────────
      -6  -3    0    3    6

When to use: Hidden layers when you need zero-centered output. Better than sigmoid for hidden layers but still suffers from vanishing gradients.

3. ReLU (Rectified Linear Unit)

ReLU(x) = max(0, x)

Output range: [0, ∞)
Shape: Ramp function

       │           ╱
       │         ╱
       │       ╱
     0 ─━━━━━╱──────────
       │   ╱
      -3  0    3    6

When to use: Default choice for hidden layers in almost all modern networks.

Advantages:

  • Computationally efficient (just a threshold)
  • No vanishing gradient for positive values
  • Induces sparsity (negative values become exactly zero)

Problem: Dying ReLU — Neurons that output zero for all inputs stop learning forever. Solutions: Leaky ReLU, ELU, or careful initialization.

4. Softmax

softmax(xᵢ) = eˣⁱ / Σⱼ eˣʲ

Converts raw scores into probabilities that sum to 1.

Input:  [2.0, 1.0, 0.1]
Output: [0.659, 0.242, 0.099]  ← Sum = 1.0

When to use: Output layer for multi-class classification. Tells you the probability of each class.

Choosing the Right Activation Function

RULE OF THUMB:

Hidden layers → ReLU (or Leaky ReLU for safety)
Binary output → Sigmoid
Multi-class output → Softmax
Value prediction → Linear (no activation) or Tanh

For Reinforcement Learning specifically:
- Q-value networks → Linear output (Q-values can be any real number)
- Policy networks → Softmax (discrete actions) or Tanh (continuous actions)
- Advantage function → Linear output

Variants of ReLU

Variant Formula Advantage
Leaky ReLU max(0.01x, x) Small gradient for negatives, prevents dying
ELU x if x>0, α(eˣ-1) if x≤0 Smoother, zero-centered negatives
GELU x·Φ(x) Used in Transformers (GPT, BERT)
Swish x·σ(x) Self-gated, smooth, used in EfficientNet

GELU is the activation function used in most modern Transformer architectures including GPT and BERT. It provides a smooth approximation to ReLU with probabilistic gating.

Action Step

When we build our first neural network in the next chapters, you will see these activation functions in action. The key takeaway: ReLU for hidden layers, Softmax/Sigmoid for outputs — this simple rule covers 90% of practical cases.

Previous What Is Deep Learning? — From Biological Neurons to Artificial Intelligence