activation_functions / activation_tutorial.md

Upload activation_tutorial.md with huggingface_hub

3283ee8 verified 11 days ago

17.4 kB

	# Comprehensive Tutorial: Activation Functions in Deep Learning

	## Table of Contents
	1. [Introduction](#introduction)
	2. [Theoretical Background](#theoretical-background)
	3. [Experiment 1: Gradient Flow](#experiment-1-gradient-flow)
	4. [Experiment 2: Sparsity and Dead Neurons](#experiment-2-sparsity-and-dead-neurons)
	5. [Experiment 3: Training Stability](#experiment-3-training-stability)
	6. [Experiment 4: Representational Capacity](#experiment-4-representational-capacity)
	7. [Experiment 5: Temporal Gradient Analysis](#experiment-5-temporal-gradient-analysis) (NEW)
	8. [Summary and Recommendations](#summary-and-recommendations)

	---

	## Introduction

	Activation functions are a critical component of neural networks that introduce non-linearity, enabling networks to learn complex patterns. This tutorial provides both theoretical explanations and empirical experiments to understand how different activation functions affect:

	1. Gradient Flow: Do gradients vanish or explode during backpropagation?
	2. Sparsity & Dead Neurons: How easily do units turn on/off?
	3. Stability: How robust is training under stress (large learning rates, deep networks)?
	4. Representational Capacity: How well can the network approximate different functions?

	### Activation Functions Studied

	\| Function \| Formula \| Range \| Key Property \|
	\|----------\|---------\|-------\|--------------\|
	\| Linear \| f(x) = x \| (-∞, ∞) \| No non-linearity \|
	\| Sigmoid \| f(x) = 1/(1+e⁻ˣ) \| (0, 1) \| Bounded, saturates \|
	\| Tanh \| f(x) = (eˣ-e⁻ˣ)/(eˣ+e⁻ˣ) \| (-1, 1) \| Zero-centered, saturates \|
	\| ReLU \| f(x) = max(0, x) \| [0, ∞) \| Sparse, can die \|
	\| Leaky ReLU \| f(x) = max(αx, x) \| (-∞, ∞) \| Prevents dead neurons \|
	\| ELU \| f(x) = x if x>0, α(eˣ-1) otherwise \| (-α, ∞) \| Smooth negative region \|
	\| GELU \| f(x) = x·Φ(x) \| ≈(-0.17, ∞) \| Smooth, probabilistic \|
	\| Swish \| f(x) = x·σ(x) \| ≈(-0.28, ∞) \| Self-gated \|

	---

	## Theoretical Background

	### Why Non-linearity Matters

	Without activation functions, a neural network of any depth is equivalent to a single linear transformation:

	```
	f(x) = Wₙ × Wₙ₋₁ × ... × W₁ × x = W_combined × x
	```

	Non-linear activations allow networks to approximate any continuous function (Universal Approximation Theorem).

	### The Gradient Flow Problem

	During backpropagation, gradients flow through the chain rule:

	```
	∂L/∂Wᵢ = ∂L/∂aₙ × ∂aₙ/∂aₙ₋₁ × ... × ∂aᵢ₊₁/∂aᵢ × ∂aᵢ/∂Wᵢ
	```

	Each layer contributes a factor of σ'(z) × W, where σ' is the activation derivative.

	Vanishing Gradients: When \|σ'(z)\| < 1 repeatedly
	- Sigmoid: σ'(z) ∈ (0, 0.25], maximum at z=0
	- For n layers: gradient ≈ (0.25)ⁿ → 0 as n → ∞

	Exploding Gradients: When \|σ'(z) × W\| > 1 repeatedly
	- More common with unbounded activations
	- Mitigated by gradient clipping, proper initialization

	---

	## Experiment 1: Gradient Flow

	### Question
	How do gradients propagate through deep networks with different activations?

	### Method
	- Built networks with depths [5, 10, 20, 50]
	- Measured gradient magnitude at each layer during backpropagation
	- Used Xavier initialization for fair comparison

	### Results

	![Gradient Flow](exp1_gradient_flow.png)

	#### Gradient Ratio (Layer 10 / Layer 1) at Depth=20

	\| Activation \| Gradient Ratio \| Interpretation \|
	\|------------\|----------------\|----------------\|
	\| Linear \| 1.43e+00 \| Stable gradient flow \|
	\| Sigmoid \| inf \| Severe vanishing gradients \|
	\| Tanh \| 5.07e-01 \| Stable gradient flow \|
	\| ReLU \| 1.08e+00 \| Stable gradient flow \|
	\| LeakyReLU \| 1.73e+00 \| Stable gradient flow \|
	\| ELU \| 8.78e-01 \| Stable gradient flow \|
	\| GELU \| 3.34e-01 \| Stable gradient flow \|
	\| Swish \| 1.14e+00 \| Stable gradient flow \|

	### Theoretical Explanation

	Sigmoid shows the most severe gradient decay because:
	- Maximum derivative is only 0.25 (at z=0)
	- In deep networks: 0.25²⁰ ≈ 10⁻¹² (effectively zero!)

	ReLU maintains gradients better because:
	- Derivative is exactly 1 for positive inputs
	- But can be exactly 0 for negative inputs (dead neurons)

	GELU/Swish provide smooth gradient flow:
	- Derivatives are bounded but not as severely as Sigmoid
	- Smooth transitions prevent sudden gradient changes

	---

	## Experiment 2: Sparsity and Dead Neurons

	### Question
	How do activations affect the sparsity of representations and the "death" of neurons?

	### Method
	- Trained 10-layer networks with high learning rate (0.1) to stress-test
	- Measured activation sparsity (% of near-zero activations)
	- Measured dead neuron rate (neurons that never activate)

	### Results

	![Sparsity and Dead Neurons](exp2_sparsity_dead_neurons.png)

	\| Activation \| Sparsity (%) \| Dead Neurons (%) \|
	\|------------\|--------------\|------------------\|
	\| Linear \| 0.0% \| 100.0% \|
	\| Sigmoid \| 8.2% \| 8.2% \|
	\| Tanh \| 0.0% \| 0.0% \|
	\| ReLU \| 48.8% \| 6.6% \|
	\| LeakyReLU \| 0.1% \| 0.0% \|
	\| ELU \| 0.0% \| 0.0% \|
	\| GELU \| 0.0% \| 0.0% \|
	\| Swish \| 0.0% \| 0.0% \|

	### Theoretical Explanation

	ReLU creates sparse representations:
	- Any negative input → output is exactly 0
	- ~50% sparsity is typical with zero-mean inputs
	- Sparsity can be beneficial (efficiency, regularization)

	Dead Neuron Problem:
	- If a ReLU neuron's input is always negative, it outputs 0 forever
	- Gradient is 0, so weights never update
	- Caused by: bad initialization, large learning rates, unlucky gradients

	Solutions:
	- Leaky ReLU: Small gradient (0.01) for negative inputs
	- ELU: Smooth negative region with non-zero gradient
	- Proper initialization: Keep activations in a good range

	---

	## Experiment 3: Training Stability

	### Question
	How stable is training under stress conditions (large learning rates, deep networks)?

	### Method
	- Tested learning rates: [0.001, 0.01, 0.1, 0.5, 1.0]
	- Tested depths: [5, 10, 20, 50, 100]
	- Measured whether training diverged (loss → ∞)

	### Results

	![Stability](exp3_stability.png)

	### Key Observations

	Learning Rate Stability:
	- Sigmoid/Tanh: Most stable (bounded outputs prevent explosion)
	- ReLU: Can diverge at high learning rates
	- GELU/Swish: Good balance of stability and performance

	Depth Stability:
	- All activations struggle with depth > 50 without special techniques
	- Sigmoid fails earliest due to vanishing gradients
	- ReLU/LeakyReLU maintain trainability longer

	### Theoretical Explanation

	Why bounded activations are more stable:
	- Sigmoid outputs ∈ (0, 1), so activations can't explode
	- But gradients can vanish, making learning very slow

	Why ReLU can be unstable:
	- Unbounded outputs: large inputs → large outputs → larger gradients
	- Positive feedback loop can cause explosion

	Modern solutions:
	- Batch Normalization: Keeps activations in good range
	- Residual Connections: Allow gradients to bypass layers
	- Gradient Clipping: Prevents explosion

	---

	## Experiment 4: Representational Capacity

	### Question
	How well can networks with different activations approximate various functions?

	### Method
	- Target functions: sin(x), \|x\|, step, sin(10x), x³
	- 5-layer networks, 500 epochs training
	- Measured test MSE

	### Results

	![Representational Capacity](exp4_representational_heatmap.png)

	![Predictions](exp4_predictions.png)

	#### Test MSE by Activation × Target Function

	\| Activation \| sin(x) \| \|x\| \| step \| sin(10x) \| x³ \|
	\|------------\|------\|------\|------\|------\|------\|
	\| Linear \| 0.0262 \| 0.3347 \| 0.0406 \| 0.4906 \| 1.4807 \|
	\| Sigmoid \| 0.0015 \| 0.0025 \| 0.0007 \| 0.4910 \| 0.0184 \|
	\| Tanh \| 0.0006 \| 0.0022 \| 0.0000 \| 0.4903 \| 0.0008 \|
	\| ReLU \| 0.0000 \| 0.0000 \| 0.0000 \| 0.0006 \| 0.0002 \|
	\| LeakyReLU \| 0.0000 \| 0.0000 \| 0.0000 \| 0.0008 \| 0.0004 \|
	\| ELU \| 0.0007 \| 0.0005 \| 0.0012 \| 0.2388 \| 0.0003 \|
	\| GELU \| 0.0000 \| 0.0006 \| 0.0001 \| 0.0009 \| 0.0033 \|
	\| Swish \| 0.0000 \| 0.0017 \| 0.0004 \| 0.4601 \| 0.0016 \|

	### Theoretical Explanation

	Universal Approximation Theorem:
	- Any continuous function can be approximated with enough neurons
	- But different activations have different "inductive biases"

	ReLU excels at piecewise functions (like \|x\|):
	- ReLU networks compute piecewise linear functions
	- Perfect match for \|x\| which is piecewise linear

	Smooth activations for smooth functions:
	- GELU, Swish produce smoother decision boundaries
	- Better for smooth targets like sin(x)

	High-frequency functions are hard:
	- sin(10x) has 10 oscillations in [-2, 2]
	- Requires many neurons to capture all oscillations
	- All activations struggle without sufficient width

	---

	## Experiment 5: Temporal Gradient Analysis

	### Question
	How do gradients evolve during training? Does the vanishing gradient problem persist or improve?

	### Method
	- Measured gradient magnitudes at epochs 1, 100, and 200
	- Tracked gradient ratio (Layer 10 / Layer 1) over time
	- Analyzed whether training helps or hurts gradient flow

	### Results

	![Gradient Flow Over Epochs](gradient_flow_epochs.png)

	![Gradient Evolution](gradient_evolution.png)

	#### Gradient Magnitudes at Key Training Epochs

	\| Activation \| Epoch \| Layer 1 \| Layer 5 \| Layer 10 \| Ratio (L10/L1) \|
	\|------------\|-------\|---------\|---------\|----------\|----------------\|
	\| Linear \| 1 \| 4.01e-04 \| 3.29e-04 \| 7.44e-04 \| 1.86 \|
	\| Linear \| 100 \| 3.10e-05 \| 2.78e-05 \| 3.57e-05 \| 1.15 \|
	\| Linear \| 200 \| 1.12e-07 \| 9.99e-08 \| 1.21e-07 \| 1.08 \|
	\| Sigmoid \| 1 \| 1.66e-10 \| 2.40e-07 \| 3.68e-03 \| 2.22e+07 \|
	\| Sigmoid \| 100 \| 1.04e-10 \| 3.24e-10 \| 4.77e-06 \| 4.59e+04 \|
	\| Sigmoid \| 200 \| 1.32e-10 \| 1.24e-10 \| 3.23e-08 \| 2.45e+02 \|
	\| ReLU \| 1 \| 1.20e-05 \| 6.12e-06 \| 3.23e-05 \| 2.69 \|
	\| ReLU \| 100 \| 2.04e-03 \| 1.28e-03 \| 4.84e-04 \| 0.24 \|
	\| ReLU \| 200 \| 1.27e-04 \| 7.49e-05 \| 1.91e-05 \| 0.15 \|
	\| Leaky ReLU \| 1 \| 2.78e-06 \| 5.04e-06 \| 3.17e-04 \| 114 \|
	\| Leaky ReLU \| 100 \| 1.30e-03 \| 4.29e-04 \| 3.37e-04 \| 0.26 \|
	\| Leaky ReLU \| 200 \| 8.98e-04 \| 8.29e-04 \| 1.79e-04 \| 0.20 \|
	\| GELU \| 1 \| 4.10e-07 \| 7.02e-07 \| 1.50e-04 \| 365 \|
	\| GELU \| 100 \| 2.66e-04 \| 1.54e-04 \| 2.57e-04 \| 0.97 \|
	\| GELU \| 200 \| 4.87e-04 \| 2.21e-04 \| 1.63e-04 \| 0.34 \|

	### Key Insights

	#### 1. Sigmoid's Catastrophic Vanishing Gradients
	- At epoch 1: Gradient ratio is 22 million to 1 (Layer 10 vs Layer 1)
	- This means Layer 1 receives 22 million times less gradient signal than Layer 10
	- The early layers essentially cannot learn!
	- Even after 200 epochs, the ratio is still 245:1

	#### 2. Modern Activations Self-Correct
	- ReLU, Leaky ReLU, GELU: Start with some gradient imbalance
	- By epoch 100-200, ratios approach 0.2-1.0 (healthy range)
	- The network learns to balance gradient flow through weight adaptation

	#### 3. Training Dynamics Visualization

	![Training Dynamics Summary](training_dynamics_summary.png)

	This comprehensive figure shows:
	- Panel A: Loss curves showing convergence speed
	- Panel B: Gradient ratio evolution over training
	- Panel C: Final learned functions
	- Panels D1-D3: Gradient flow at epochs 1, 100, 200
	- Panels E1-E3: Function approximation at epochs 50, 200, 499

	### Theoretical Explanation

	Why Sigmoid gradients don't improve:
	- Sigmoid saturates to 0 or 1 for large inputs
	- Derivative σ'(z) = σ(z)(1-σ(z)) → 0 when σ(z) → 0 or 1
	- Deep layers push activations toward saturation
	- Early layers are "locked" and cannot adapt

	Why ReLU/GELU gradients stabilize:
	- Adam optimizer adapts learning rates per-parameter
	- Weights adjust to keep activations in "active" region
	- Network finds a gradient-friendly configuration

	### Practical Implications

	1. Sigmoid is fundamentally broken for deep hidden layers
	- Not just slow to train, but mathematically unable to learn
	- Early layers receive ~10⁻¹⁰ gradient magnitude

	2. Modern activations are self-healing
	- Initial gradient imbalance corrects during training
	- Adam optimizer helps by adapting per-parameter learning rates

	3. Monitor gradient ratios during training
	- Ratio > 100 indicates vanishing gradients
	- Ratio < 0.01 indicates exploding gradients
	- Healthy range: 0.1 to 10

	---

	## Summary and Recommendations

	### Comparison Table

	\| Property \| Best Activations \| Worst Activations \|
	\|----------\|------------------\|-------------------\|
	\| Gradient Flow \| LeakyReLU, GELU \| Sigmoid, Tanh \|
	\| Avoids Dead Neurons \| LeakyReLU, ELU, GELU \| ReLU \|
	\| Training Stability \| Sigmoid, Tanh, GELU \| ReLU (high lr) \|
	\| Smooth Functions \| GELU, Swish, Tanh \| ReLU \|
	\| Sharp Functions \| ReLU, LeakyReLU \| Sigmoid \|
	\| Computational Speed \| ReLU, LeakyReLU \| GELU, Swish \|

	### Practical Recommendations

	1. Default Choice: ReLU or LeakyReLU
	- Simple, fast, effective for most tasks
	- Use LeakyReLU if dead neurons are a concern

	2. For Transformers/Attention: GELU
	- Standard in BERT, GPT, modern transformers
	- Smooth gradients help with optimization

	3. For Very Deep Networks: LeakyReLU or ELU
	- Or use residual connections + batch normalization
	- Avoid Sigmoid/Tanh in hidden layers

	4. For Regression with Bounded Outputs: Sigmoid (output layer only)
	- Use for probabilities or [0, 1] outputs
	- Never in hidden layers of deep networks

	5. For RNNs/LSTMs: Tanh (traditional choice)
	- Zero-centered helps with recurrent dynamics
	- Modern alternative: use Transformers instead

	### The Big Picture

	```
	ACTIVATION FUNCTION SELECTION GUIDE

	┌─────────────────────────────────────────────────────────────┐
	│ Is it a hidden layer? │
	└─────────────────────────────────────────────────────────────┘
	│
	┌───────────────┴───────────────┐
	▼ ▼
	YES NO (output layer)
	│ │
	▼ ▼
	┌─────────────────┐ ┌─────────────────────┐
	│ Is it a │ │ What's the task? │
	│ Transformer? │ │ │
	└─────────────────┘ │ Binary class → Sigmoid
	│ │ Multi-class → Softmax
	┌───────┴───────┐ │ Regression → Linear │
	▼ ▼ └─────────────────────┘
	YES NO
	│ │
	▼ ▼
	GELU ┌─────────────────┐
	│ Worried about │
	│ dead neurons? │
	└─────────────────┘
	│
	┌───────┴───────┐
	▼ ▼
	YES NO
	│ │
	▼ ▼
	LeakyReLU ReLU
	or ELU
	```

	---

	## Files Generated

	\| File \| Description \|
	\|------\|-------------\|
	\| learned_functions.png \| Final learned functions vs ground truth \|
	\| loss_curves.png \| Training loss curves over 500 epochs \|
	\| gradient_flow.png \| Gradient magnitude across layers (epoch 1) \|
	\| gradient_flow_epochs.png \| NEW Gradient flow at epochs 1, 100, 200 \|
	\| gradient_evolution.png \| NEW Gradient ratio evolution over training \|
	\| hidden_activations.png \| Activation distributions in trained network \|
	\| training_dynamics_functions.png \| NEW Function learning over time \|
	\| activation_evolution.png \| NEW Activation distribution evolution \|
	\| training_dynamics_summary.png \| NEW Comprehensive training dynamics \|
	\| exp1_gradient_flow.png \| Gradient magnitude across layers \|
	\| exp2_sparsity_dead_neurons.png \| Sparsity and dead neuron rates \|
	\| exp2_activation_distributions.png \| Activation value distributions \|
	\| exp3_stability.png \| Stability vs learning rate and depth \|
	\| exp4_representational_heatmap.png \| MSE heatmap for different targets \|
	\| exp4_predictions.png \| Actual predictions vs ground truth \|
	\| summary_figure.png \| Comprehensive summary visualization \|

	---

	## References

	1. Glorot, X., & Bengio, Y. (2010). Understanding the difficulty of training deep feedforward neural networks.
	2. He, K., et al. (2015). Delving deep into rectifiers: Surpassing human-level performance on ImageNet classification.
	3. Hendrycks, D., & Gimpel, K. (2016). Gaussian Error Linear Units (GELUs).
	4. Ramachandran, P., et al. (2017). Searching for Activation Functions.
	5. Nwankpa, C., et al. (2018). Activation Functions: Comparison of trends in Practice and Research for Deep Learning.

	---

	Tutorial generated by Orchestra Research Assistant
	All experiments are reproducible with the provided code