| # Comprehensive Tutorial: Activation Functions in Deep Learning | |
| ## Table of Contents | |
| 1. [Introduction](#introduction) | |
| 2. [Theoretical Background](#theoretical-background) | |
| 3. [Experiment 1: Gradient Flow](#experiment-1-gradient-flow) | |
| 4. [Experiment 2: Sparsity and Dead Neurons](#experiment-2-sparsity-and-dead-neurons) | |
| 5. [Experiment 3: Training Stability](#experiment-3-training-stability) | |
| 6. [Experiment 4: Representational Capacity](#experiment-4-representational-capacity) | |
| 7. [**Experiment 5: Temporal Gradient Analysis**](#experiment-5-temporal-gradient-analysis) *(NEW)* | |
| 8. [Summary and Recommendations](#summary-and-recommendations) | |
| --- | |
| ## Introduction | |
| Activation functions are a critical component of neural networks that introduce non-linearity, enabling networks to learn complex patterns. This tutorial provides both **theoretical explanations** and **empirical experiments** to understand how different activation functions affect: | |
| 1. **Gradient Flow**: Do gradients vanish or explode during backpropagation? | |
| 2. **Sparsity & Dead Neurons**: How easily do units turn on/off? | |
| 3. **Stability**: How robust is training under stress (large learning rates, deep networks)? | |
| 4. **Representational Capacity**: How well can the network approximate different functions? | |
| ### Activation Functions Studied | |
| | Function | Formula | Range | Key Property | | |
| |----------|---------|-------|--------------| | |
| | Linear | f(x) = x | (-β, β) | No non-linearity | | |
| | Sigmoid | f(x) = 1/(1+eβ»Λ£) | (0, 1) | Bounded, saturates | | |
| | Tanh | f(x) = (eΛ£-eβ»Λ£)/(eΛ£+eβ»Λ£) | (-1, 1) | Zero-centered, saturates | | |
| | ReLU | f(x) = max(0, x) | [0, β) | Sparse, can die | | |
| | Leaky ReLU | f(x) = max(Ξ±x, x) | (-β, β) | Prevents dead neurons | | |
| | ELU | f(x) = x if x>0, Ξ±(eΛ£-1) otherwise | (-Ξ±, β) | Smooth negative region | | |
| | GELU | f(x) = xΒ·Ξ¦(x) | β(-0.17, β) | Smooth, probabilistic | | |
| | Swish | f(x) = xΒ·Ο(x) | β(-0.28, β) | Self-gated | | |
| --- | |
| ## Theoretical Background | |
| ### Why Non-linearity Matters | |
| Without activation functions, a neural network of any depth is equivalent to a single linear transformation: | |
| ``` | |
| f(x) = Wβ Γ Wβββ Γ ... Γ Wβ Γ x = W_combined Γ x | |
| ``` | |
| Non-linear activations allow networks to approximate **any continuous function** (Universal Approximation Theorem). | |
| ### The Gradient Flow Problem | |
| During backpropagation, gradients flow through the chain rule: | |
| ``` | |
| βL/βWα΅’ = βL/βaβ Γ βaβ/βaβββ Γ ... Γ βaα΅’ββ/βaα΅’ Γ βaα΅’/βWα΅’ | |
| ``` | |
| Each layer contributes a factor of **Ο'(z) Γ W**, where Ο' is the activation derivative. | |
| **Vanishing Gradients**: When |Ο'(z)| < 1 repeatedly | |
| - Sigmoid: Ο'(z) β (0, 0.25], maximum at z=0 | |
| - For n layers: gradient β (0.25)βΏ β 0 as n β β | |
| **Exploding Gradients**: When |Ο'(z) Γ W| > 1 repeatedly | |
| - More common with unbounded activations | |
| - Mitigated by gradient clipping, proper initialization | |
| --- | |
| ## Experiment 1: Gradient Flow | |
| ### Question | |
| How do gradients propagate through deep networks with different activations? | |
| ### Method | |
| - Built networks with depths [5, 10, 20, 50] | |
| - Measured gradient magnitude at each layer during backpropagation | |
| - Used Xavier initialization for fair comparison | |
| ### Results | |
|  | |
| #### Gradient Ratio (Layer 10 / Layer 1) at Depth=20 | |
| | Activation | Gradient Ratio | Interpretation | | |
| |------------|----------------|----------------| | |
| | Linear | 1.43e+00 | Stable gradient flow | | |
| | Sigmoid | inf | Severe vanishing gradients | | |
| | Tanh | 5.07e-01 | Stable gradient flow | | |
| | ReLU | 1.08e+00 | Stable gradient flow | | |
| | LeakyReLU | 1.73e+00 | Stable gradient flow | | |
| | ELU | 8.78e-01 | Stable gradient flow | | |
| | GELU | 3.34e-01 | Stable gradient flow | | |
| | Swish | 1.14e+00 | Stable gradient flow | | |
| ### Theoretical Explanation | |
| **Sigmoid** shows the most severe gradient decay because: | |
| - Maximum derivative is only 0.25 (at z=0) | |
| - In deep networks: 0.25Β²β° β 10β»ΒΉΒ² (effectively zero!) | |
| **ReLU** maintains gradients better because: | |
| - Derivative is exactly 1 for positive inputs | |
| - But can be exactly 0 for negative inputs (dead neurons) | |
| **GELU/Swish** provide smooth gradient flow: | |
| - Derivatives are bounded but not as severely as Sigmoid | |
| - Smooth transitions prevent sudden gradient changes | |
| --- | |
| ## Experiment 2: Sparsity and Dead Neurons | |
| ### Question | |
| How do activations affect the sparsity of representations and the "death" of neurons? | |
| ### Method | |
| - Trained 10-layer networks with high learning rate (0.1) to stress-test | |
| - Measured activation sparsity (% of near-zero activations) | |
| - Measured dead neuron rate (neurons that never activate) | |
| ### Results | |
|  | |
| | Activation | Sparsity (%) | Dead Neurons (%) | | |
| |------------|--------------|------------------| | |
| | Linear | 0.0% | 100.0% | | |
| | Sigmoid | 8.2% | 8.2% | | |
| | Tanh | 0.0% | 0.0% | | |
| | ReLU | 48.8% | 6.6% | | |
| | LeakyReLU | 0.1% | 0.0% | | |
| | ELU | 0.0% | 0.0% | | |
| | GELU | 0.0% | 0.0% | | |
| | Swish | 0.0% | 0.0% | | |
| ### Theoretical Explanation | |
| **ReLU creates sparse representations**: | |
| - Any negative input β output is exactly 0 | |
| - ~50% sparsity is typical with zero-mean inputs | |
| - Sparsity can be beneficial (efficiency, regularization) | |
| **Dead Neuron Problem**: | |
| - If a ReLU neuron's input is always negative, it outputs 0 forever | |
| - Gradient is 0, so weights never update | |
| - Caused by: bad initialization, large learning rates, unlucky gradients | |
| **Solutions**: | |
| - **Leaky ReLU**: Small gradient (0.01) for negative inputs | |
| - **ELU**: Smooth negative region with non-zero gradient | |
| - **Proper initialization**: Keep activations in a good range | |
| --- | |
| ## Experiment 3: Training Stability | |
| ### Question | |
| How stable is training under stress conditions (large learning rates, deep networks)? | |
| ### Method | |
| - Tested learning rates: [0.001, 0.01, 0.1, 0.5, 1.0] | |
| - Tested depths: [5, 10, 20, 50, 100] | |
| - Measured whether training diverged (loss β β) | |
| ### Results | |
|  | |
| ### Key Observations | |
| **Learning Rate Stability**: | |
| - Sigmoid/Tanh: Most stable (bounded outputs prevent explosion) | |
| - ReLU: Can diverge at high learning rates | |
| - GELU/Swish: Good balance of stability and performance | |
| **Depth Stability**: | |
| - All activations struggle with depth > 50 without special techniques | |
| - Sigmoid fails earliest due to vanishing gradients | |
| - ReLU/LeakyReLU maintain trainability longer | |
| ### Theoretical Explanation | |
| **Why bounded activations are more stable**: | |
| - Sigmoid outputs β (0, 1), so activations can't explode | |
| - But gradients can vanish, making learning very slow | |
| **Why ReLU can be unstable**: | |
| - Unbounded outputs: large inputs β large outputs β larger gradients | |
| - Positive feedback loop can cause explosion | |
| **Modern solutions**: | |
| - Batch Normalization: Keeps activations in good range | |
| - Residual Connections: Allow gradients to bypass layers | |
| - Gradient Clipping: Prevents explosion | |
| --- | |
| ## Experiment 4: Representational Capacity | |
| ### Question | |
| How well can networks with different activations approximate various functions? | |
| ### Method | |
| - Target functions: sin(x), |x|, step, sin(10x), xΒ³ | |
| - 5-layer networks, 500 epochs training | |
| - Measured test MSE | |
| ### Results | |
|  | |
|  | |
| #### Test MSE by Activation Γ Target Function | |
| | Activation | sin(x) | |x| | step | sin(10x) | xΒ³ | | |
| |------------|------|------|------|------|------| | |
| | Linear | 0.0262 | 0.3347 | 0.0406 | 0.4906 | 1.4807 | | |
| | Sigmoid | 0.0015 | 0.0025 | 0.0007 | 0.4910 | 0.0184 | | |
| | Tanh | 0.0006 | 0.0022 | 0.0000 | 0.4903 | 0.0008 | | |
| | ReLU | 0.0000 | 0.0000 | 0.0000 | 0.0006 | 0.0002 | | |
| | LeakyReLU | 0.0000 | 0.0000 | 0.0000 | 0.0008 | 0.0004 | | |
| | ELU | 0.0007 | 0.0005 | 0.0012 | 0.2388 | 0.0003 | | |
| | GELU | 0.0000 | 0.0006 | 0.0001 | 0.0009 | 0.0033 | | |
| | Swish | 0.0000 | 0.0017 | 0.0004 | 0.4601 | 0.0016 | | |
| ### Theoretical Explanation | |
| **Universal Approximation Theorem**: | |
| - Any continuous function can be approximated with enough neurons | |
| - But different activations have different "inductive biases" | |
| **ReLU excels at piecewise functions** (like |x|): | |
| - ReLU networks compute piecewise linear functions | |
| - Perfect match for |x| which is piecewise linear | |
| **Smooth activations for smooth functions**: | |
| - GELU, Swish produce smoother decision boundaries | |
| - Better for smooth targets like sin(x) | |
| **High-frequency functions are hard**: | |
| - sin(10x) has 10 oscillations in [-2, 2] | |
| - Requires many neurons to capture all oscillations | |
| - All activations struggle without sufficient width | |
| --- | |
| ## Experiment 5: Temporal Gradient Analysis | |
| ### Question | |
| How do gradients evolve during training? Does the vanishing gradient problem persist or improve? | |
| ### Method | |
| - Measured gradient magnitudes at epochs 1, 100, and 200 | |
| - Tracked gradient ratio (Layer 10 / Layer 1) over time | |
| - Analyzed whether training helps or hurts gradient flow | |
| ### Results | |
|  | |
|  | |
| #### Gradient Magnitudes at Key Training Epochs | |
| | Activation | Epoch | Layer 1 | Layer 5 | Layer 10 | Ratio (L10/L1) | | |
| |------------|-------|---------|---------|----------|----------------| | |
| | Linear | 1 | 4.01e-04 | 3.29e-04 | 7.44e-04 | 1.86 | | |
| | Linear | 100 | 3.10e-05 | 2.78e-05 | 3.57e-05 | 1.15 | | |
| | Linear | 200 | 1.12e-07 | 9.99e-08 | 1.21e-07 | 1.08 | | |
| | **Sigmoid** | **1** | **1.66e-10** | **2.40e-07** | **3.68e-03** | **2.22e+07** | | |
| | **Sigmoid** | **100** | **1.04e-10** | **3.24e-10** | **4.77e-06** | **4.59e+04** | | |
| | **Sigmoid** | **200** | **1.32e-10** | **1.24e-10** | **3.23e-08** | **2.45e+02** | | |
| | ReLU | 1 | 1.20e-05 | 6.12e-06 | 3.23e-05 | 2.69 | | |
| | ReLU | 100 | 2.04e-03 | 1.28e-03 | 4.84e-04 | 0.24 | | |
| | ReLU | 200 | 1.27e-04 | 7.49e-05 | 1.91e-05 | 0.15 | | |
| | Leaky ReLU | 1 | 2.78e-06 | 5.04e-06 | 3.17e-04 | 114 | | |
| | Leaky ReLU | 100 | 1.30e-03 | 4.29e-04 | 3.37e-04 | 0.26 | | |
| | Leaky ReLU | 200 | 8.98e-04 | 8.29e-04 | 1.79e-04 | 0.20 | | |
| | GELU | 1 | 4.10e-07 | 7.02e-07 | 1.50e-04 | 365 | | |
| | GELU | 100 | 2.66e-04 | 1.54e-04 | 2.57e-04 | 0.97 | | |
| | GELU | 200 | 4.87e-04 | 2.21e-04 | 1.63e-04 | 0.34 | | |
| ### Key Insights | |
| #### 1. Sigmoid's Catastrophic Vanishing Gradients | |
| - **At epoch 1**: Gradient ratio is **22 million to 1** (Layer 10 vs Layer 1) | |
| - This means Layer 1 receives 22 million times less gradient signal than Layer 10 | |
| - The early layers essentially cannot learn! | |
| - Even after 200 epochs, the ratio is still 245:1 | |
| #### 2. Modern Activations Self-Correct | |
| - **ReLU, Leaky ReLU, GELU**: Start with some gradient imbalance | |
| - By epoch 100-200, ratios approach 0.2-1.0 (healthy range) | |
| - The network learns to balance gradient flow through weight adaptation | |
| #### 3. Training Dynamics Visualization | |
|  | |
| This comprehensive figure shows: | |
| - **Panel A**: Loss curves showing convergence speed | |
| - **Panel B**: Gradient ratio evolution over training | |
| - **Panel C**: Final learned functions | |
| - **Panels D1-D3**: Gradient flow at epochs 1, 100, 200 | |
| - **Panels E1-E3**: Function approximation at epochs 50, 200, 499 | |
| ### Theoretical Explanation | |
| **Why Sigmoid gradients don't improve**: | |
| - Sigmoid saturates to 0 or 1 for large inputs | |
| - Derivative Ο'(z) = Ο(z)(1-Ο(z)) β 0 when Ο(z) β 0 or 1 | |
| - Deep layers push activations toward saturation | |
| - Early layers are "locked" and cannot adapt | |
| **Why ReLU/GELU gradients stabilize**: | |
| - Adam optimizer adapts learning rates per-parameter | |
| - Weights adjust to keep activations in "active" region | |
| - Network finds a gradient-friendly configuration | |
| ### Practical Implications | |
| 1. **Sigmoid is fundamentally broken for deep hidden layers** | |
| - Not just slow to train, but mathematically unable to learn | |
| - Early layers receive ~10β»ΒΉβ° gradient magnitude | |
| 2. **Modern activations are self-healing** | |
| - Initial gradient imbalance corrects during training | |
| - Adam optimizer helps by adapting per-parameter learning rates | |
| 3. **Monitor gradient ratios during training** | |
| - Ratio > 100 indicates vanishing gradients | |
| - Ratio < 0.01 indicates exploding gradients | |
| - Healthy range: 0.1 to 10 | |
| --- | |
| ## Summary and Recommendations | |
| ### Comparison Table | |
| | Property | Best Activations | Worst Activations | | |
| |----------|------------------|-------------------| | |
| | Gradient Flow | LeakyReLU, GELU | Sigmoid, Tanh | | |
| | Avoids Dead Neurons | LeakyReLU, ELU, GELU | ReLU | | |
| | Training Stability | Sigmoid, Tanh, GELU | ReLU (high lr) | | |
| | Smooth Functions | GELU, Swish, Tanh | ReLU | | |
| | Sharp Functions | ReLU, LeakyReLU | Sigmoid | | |
| | Computational Speed | ReLU, LeakyReLU | GELU, Swish | | |
| ### Practical Recommendations | |
| 1. **Default Choice**: **ReLU** or **LeakyReLU** | |
| - Simple, fast, effective for most tasks | |
| - Use LeakyReLU if dead neurons are a concern | |
| 2. **For Transformers/Attention**: **GELU** | |
| - Standard in BERT, GPT, modern transformers | |
| - Smooth gradients help with optimization | |
| 3. **For Very Deep Networks**: **LeakyReLU** or **ELU** | |
| - Or use residual connections + batch normalization | |
| - Avoid Sigmoid/Tanh in hidden layers | |
| 4. **For Regression with Bounded Outputs**: **Sigmoid** (output layer only) | |
| - Use for probabilities or [0, 1] outputs | |
| - Never in hidden layers of deep networks | |
| 5. **For RNNs/LSTMs**: **Tanh** (traditional choice) | |
| - Zero-centered helps with recurrent dynamics | |
| - Modern alternative: use Transformers instead | |
| ### The Big Picture | |
| ``` | |
| ACTIVATION FUNCTION SELECTION GUIDE | |
| βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ | |
| β Is it a hidden layer? β | |
| βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ | |
| β | |
| βββββββββββββββββ΄ββββββββββββββββ | |
| βΌ βΌ | |
| YES NO (output layer) | |
| β β | |
| βΌ βΌ | |
| βββββββββββββββββββ βββββββββββββββββββββββ | |
| β Is it a β β What's the task? β | |
| β Transformer? β β β | |
| βββββββββββββββββββ β Binary class β Sigmoid | |
| β β Multi-class β Softmax | |
| βββββββββ΄ββββββββ β Regression β Linear β | |
| βΌ βΌ βββββββββββββββββββββββ | |
| YES NO | |
| β β | |
| βΌ βΌ | |
| GELU βββββββββββββββββββ | |
| β Worried about β | |
| β dead neurons? β | |
| βββββββββββββββββββ | |
| β | |
| βββββββββ΄ββββββββ | |
| βΌ βΌ | |
| YES NO | |
| β β | |
| βΌ βΌ | |
| LeakyReLU ReLU | |
| or ELU | |
| ``` | |
| --- | |
| ## Files Generated | |
| | File | Description | | |
| |------|-------------| | |
| | learned_functions.png | Final learned functions vs ground truth | | |
| | loss_curves.png | Training loss curves over 500 epochs | | |
| | gradient_flow.png | Gradient magnitude across layers (epoch 1) | | |
| | gradient_flow_epochs.png | **NEW** Gradient flow at epochs 1, 100, 200 | | |
| | gradient_evolution.png | **NEW** Gradient ratio evolution over training | | |
| | hidden_activations.png | Activation distributions in trained network | | |
| | training_dynamics_functions.png | **NEW** Function learning over time | | |
| | activation_evolution.png | **NEW** Activation distribution evolution | | |
| | training_dynamics_summary.png | **NEW** Comprehensive training dynamics | | |
| | exp1_gradient_flow.png | Gradient magnitude across layers | | |
| | exp2_sparsity_dead_neurons.png | Sparsity and dead neuron rates | | |
| | exp2_activation_distributions.png | Activation value distributions | | |
| | exp3_stability.png | Stability vs learning rate and depth | | |
| | exp4_representational_heatmap.png | MSE heatmap for different targets | | |
| | exp4_predictions.png | Actual predictions vs ground truth | | |
| | summary_figure.png | Comprehensive summary visualization | | |
| --- | |
| ## References | |
| 1. Glorot, X., & Bengio, Y. (2010). Understanding the difficulty of training deep feedforward neural networks. | |
| 2. He, K., et al. (2015). Delving deep into rectifiers: Surpassing human-level performance on ImageNet classification. | |
| 3. Hendrycks, D., & Gimpel, K. (2016). Gaussian Error Linear Units (GELUs). | |
| 4. Ramachandran, P., et al. (2017). Searching for Activation Functions. | |
| 5. Nwankpa, C., et al. (2018). Activation Functions: Comparison of trends in Practice and Research for Deep Learning. | |
| --- | |
| *Tutorial generated by Orchestra Research Assistant* | |
| *All experiments are reproducible with the provided code* | |