File size: 17,385 Bytes
3283ee8
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
# Comprehensive Tutorial: Activation Functions in Deep Learning

## Table of Contents
1. [Introduction](#introduction)
2. [Theoretical Background](#theoretical-background)
3. [Experiment 1: Gradient Flow](#experiment-1-gradient-flow)
4. [Experiment 2: Sparsity and Dead Neurons](#experiment-2-sparsity-and-dead-neurons)
5. [Experiment 3: Training Stability](#experiment-3-training-stability)
6. [Experiment 4: Representational Capacity](#experiment-4-representational-capacity)
7. [**Experiment 5: Temporal Gradient Analysis**](#experiment-5-temporal-gradient-analysis) *(NEW)*
8. [Summary and Recommendations](#summary-and-recommendations)

---

## Introduction

Activation functions are a critical component of neural networks that introduce non-linearity, enabling networks to learn complex patterns. This tutorial provides both **theoretical explanations** and **empirical experiments** to understand how different activation functions affect:

1. **Gradient Flow**: Do gradients vanish or explode during backpropagation?
2. **Sparsity & Dead Neurons**: How easily do units turn on/off?
3. **Stability**: How robust is training under stress (large learning rates, deep networks)?
4. **Representational Capacity**: How well can the network approximate different functions?

### Activation Functions Studied

| Function | Formula | Range | Key Property |
|----------|---------|-------|--------------|
| Linear | f(x) = x | (-∞, ∞) | No non-linearity |
| Sigmoid | f(x) = 1/(1+e⁻ˣ) | (0, 1) | Bounded, saturates |
| Tanh | f(x) = (eˣ-e⁻ˣ)/(eˣ+e⁻ˣ) | (-1, 1) | Zero-centered, saturates |
| ReLU | f(x) = max(0, x) | [0, ∞) | Sparse, can die |
| Leaky ReLU | f(x) = max(αx, x) | (-∞, ∞) | Prevents dead neurons |
| ELU | f(x) = x if x>0, α(eˣ-1) otherwise | (-α, ∞) | Smooth negative region |
| GELU | f(x) = xΒ·Ξ¦(x) | β‰ˆ(-0.17, ∞) | Smooth, probabilistic |
| Swish | f(x) = xΒ·Οƒ(x) | β‰ˆ(-0.28, ∞) | Self-gated |

---

## Theoretical Background

### Why Non-linearity Matters

Without activation functions, a neural network of any depth is equivalent to a single linear transformation:

```
f(x) = Wβ‚™ Γ— Wₙ₋₁ Γ— ... Γ— W₁ Γ— x = W_combined Γ— x
```

Non-linear activations allow networks to approximate **any continuous function** (Universal Approximation Theorem).

### The Gradient Flow Problem

During backpropagation, gradients flow through the chain rule:

```
βˆ‚L/βˆ‚Wα΅’ = βˆ‚L/βˆ‚aβ‚™ Γ— βˆ‚aβ‚™/βˆ‚aₙ₋₁ Γ— ... Γ— βˆ‚aα΅’β‚Šβ‚/βˆ‚aα΅’ Γ— βˆ‚aα΅’/βˆ‚Wα΅’
```

Each layer contributes a factor of **Οƒ'(z) Γ— W**, where Οƒ' is the activation derivative.

**Vanishing Gradients**: When |Οƒ'(z)| < 1 repeatedly
- Sigmoid: Οƒ'(z) ∈ (0, 0.25], maximum at z=0
- For n layers: gradient β‰ˆ (0.25)ⁿ β†’ 0 as n β†’ ∞

**Exploding Gradients**: When |Οƒ'(z) Γ— W| > 1 repeatedly
- More common with unbounded activations
- Mitigated by gradient clipping, proper initialization

---

## Experiment 1: Gradient Flow

### Question
How do gradients propagate through deep networks with different activations?

### Method
- Built networks with depths [5, 10, 20, 50]
- Measured gradient magnitude at each layer during backpropagation
- Used Xavier initialization for fair comparison

### Results

![Gradient Flow](exp1_gradient_flow.png)

#### Gradient Ratio (Layer 10 / Layer 1) at Depth=20

| Activation | Gradient Ratio | Interpretation |
|------------|----------------|----------------|
| Linear | 1.43e+00 | Stable gradient flow |
| Sigmoid | inf | Severe vanishing gradients |
| Tanh | 5.07e-01 | Stable gradient flow |
| ReLU | 1.08e+00 | Stable gradient flow |
| LeakyReLU | 1.73e+00 | Stable gradient flow |
| ELU | 8.78e-01 | Stable gradient flow |
| GELU | 3.34e-01 | Stable gradient flow |
| Swish | 1.14e+00 | Stable gradient flow |

### Theoretical Explanation

**Sigmoid** shows the most severe gradient decay because:
- Maximum derivative is only 0.25 (at z=0)
- In deep networks: 0.25²⁰ β‰ˆ 10⁻¹² (effectively zero!)

**ReLU** maintains gradients better because:
- Derivative is exactly 1 for positive inputs
- But can be exactly 0 for negative inputs (dead neurons)

**GELU/Swish** provide smooth gradient flow:
- Derivatives are bounded but not as severely as Sigmoid
- Smooth transitions prevent sudden gradient changes

---

## Experiment 2: Sparsity and Dead Neurons

### Question
How do activations affect the sparsity of representations and the "death" of neurons?

### Method
- Trained 10-layer networks with high learning rate (0.1) to stress-test
- Measured activation sparsity (% of near-zero activations)
- Measured dead neuron rate (neurons that never activate)

### Results

![Sparsity and Dead Neurons](exp2_sparsity_dead_neurons.png)

| Activation | Sparsity (%) | Dead Neurons (%) |
|------------|--------------|------------------|
| Linear | 0.0% | 100.0% |
| Sigmoid | 8.2% | 8.2% |
| Tanh | 0.0% | 0.0% |
| ReLU | 48.8% | 6.6% |
| LeakyReLU | 0.1% | 0.0% |
| ELU | 0.0% | 0.0% |
| GELU | 0.0% | 0.0% |
| Swish | 0.0% | 0.0% |

### Theoretical Explanation

**ReLU creates sparse representations**:
- Any negative input β†’ output is exactly 0
- ~50% sparsity is typical with zero-mean inputs
- Sparsity can be beneficial (efficiency, regularization)

**Dead Neuron Problem**:
- If a ReLU neuron's input is always negative, it outputs 0 forever
- Gradient is 0, so weights never update
- Caused by: bad initialization, large learning rates, unlucky gradients

**Solutions**:
- **Leaky ReLU**: Small gradient (0.01) for negative inputs
- **ELU**: Smooth negative region with non-zero gradient
- **Proper initialization**: Keep activations in a good range

---

## Experiment 3: Training Stability

### Question
How stable is training under stress conditions (large learning rates, deep networks)?

### Method
- Tested learning rates: [0.001, 0.01, 0.1, 0.5, 1.0]
- Tested depths: [5, 10, 20, 50, 100]
- Measured whether training diverged (loss β†’ ∞)

### Results

![Stability](exp3_stability.png)

### Key Observations

**Learning Rate Stability**:
- Sigmoid/Tanh: Most stable (bounded outputs prevent explosion)
- ReLU: Can diverge at high learning rates
- GELU/Swish: Good balance of stability and performance

**Depth Stability**:
- All activations struggle with depth > 50 without special techniques
- Sigmoid fails earliest due to vanishing gradients
- ReLU/LeakyReLU maintain trainability longer

### Theoretical Explanation

**Why bounded activations are more stable**:
- Sigmoid outputs ∈ (0, 1), so activations can't explode
- But gradients can vanish, making learning very slow

**Why ReLU can be unstable**:
- Unbounded outputs: large inputs β†’ large outputs β†’ larger gradients
- Positive feedback loop can cause explosion

**Modern solutions**:
- Batch Normalization: Keeps activations in good range
- Residual Connections: Allow gradients to bypass layers
- Gradient Clipping: Prevents explosion

---

## Experiment 4: Representational Capacity

### Question
How well can networks with different activations approximate various functions?

### Method
- Target functions: sin(x), |x|, step, sin(10x), xΒ³
- 5-layer networks, 500 epochs training
- Measured test MSE

### Results

![Representational Capacity](exp4_representational_heatmap.png)

![Predictions](exp4_predictions.png)

#### Test MSE by Activation Γ— Target Function

| Activation | sin(x) | |x| | step | sin(10x) | xΒ³ |
|------------|------|------|------|------|------|
| Linear | 0.0262 | 0.3347 | 0.0406 | 0.4906 | 1.4807 |
| Sigmoid | 0.0015 | 0.0025 | 0.0007 | 0.4910 | 0.0184 |
| Tanh | 0.0006 | 0.0022 | 0.0000 | 0.4903 | 0.0008 |
| ReLU | 0.0000 | 0.0000 | 0.0000 | 0.0006 | 0.0002 |
| LeakyReLU | 0.0000 | 0.0000 | 0.0000 | 0.0008 | 0.0004 |
| ELU | 0.0007 | 0.0005 | 0.0012 | 0.2388 | 0.0003 |
| GELU | 0.0000 | 0.0006 | 0.0001 | 0.0009 | 0.0033 |
| Swish | 0.0000 | 0.0017 | 0.0004 | 0.4601 | 0.0016 |

### Theoretical Explanation

**Universal Approximation Theorem**:
- Any continuous function can be approximated with enough neurons
- But different activations have different "inductive biases"

**ReLU excels at piecewise functions** (like |x|):
- ReLU networks compute piecewise linear functions
- Perfect match for |x| which is piecewise linear

**Smooth activations for smooth functions**:
- GELU, Swish produce smoother decision boundaries
- Better for smooth targets like sin(x)

**High-frequency functions are hard**:
- sin(10x) has 10 oscillations in [-2, 2]
- Requires many neurons to capture all oscillations
- All activations struggle without sufficient width

---

## Experiment 5: Temporal Gradient Analysis

### Question
How do gradients evolve during training? Does the vanishing gradient problem persist or improve?

### Method
- Measured gradient magnitudes at epochs 1, 100, and 200
- Tracked gradient ratio (Layer 10 / Layer 1) over time
- Analyzed whether training helps or hurts gradient flow

### Results

![Gradient Flow Over Epochs](gradient_flow_epochs.png)

![Gradient Evolution](gradient_evolution.png)

#### Gradient Magnitudes at Key Training Epochs

| Activation | Epoch | Layer 1 | Layer 5 | Layer 10 | Ratio (L10/L1) |
|------------|-------|---------|---------|----------|----------------|
| Linear | 1 | 4.01e-04 | 3.29e-04 | 7.44e-04 | 1.86 |
| Linear | 100 | 3.10e-05 | 2.78e-05 | 3.57e-05 | 1.15 |
| Linear | 200 | 1.12e-07 | 9.99e-08 | 1.21e-07 | 1.08 |
| **Sigmoid** | **1** | **1.66e-10** | **2.40e-07** | **3.68e-03** | **2.22e+07** |
| **Sigmoid** | **100** | **1.04e-10** | **3.24e-10** | **4.77e-06** | **4.59e+04** |
| **Sigmoid** | **200** | **1.32e-10** | **1.24e-10** | **3.23e-08** | **2.45e+02** |
| ReLU | 1 | 1.20e-05 | 6.12e-06 | 3.23e-05 | 2.69 |
| ReLU | 100 | 2.04e-03 | 1.28e-03 | 4.84e-04 | 0.24 |
| ReLU | 200 | 1.27e-04 | 7.49e-05 | 1.91e-05 | 0.15 |
| Leaky ReLU | 1 | 2.78e-06 | 5.04e-06 | 3.17e-04 | 114 |
| Leaky ReLU | 100 | 1.30e-03 | 4.29e-04 | 3.37e-04 | 0.26 |
| Leaky ReLU | 200 | 8.98e-04 | 8.29e-04 | 1.79e-04 | 0.20 |
| GELU | 1 | 4.10e-07 | 7.02e-07 | 1.50e-04 | 365 |
| GELU | 100 | 2.66e-04 | 1.54e-04 | 2.57e-04 | 0.97 |
| GELU | 200 | 4.87e-04 | 2.21e-04 | 1.63e-04 | 0.34 |

### Key Insights

#### 1. Sigmoid's Catastrophic Vanishing Gradients
- **At epoch 1**: Gradient ratio is **22 million to 1** (Layer 10 vs Layer 1)
- This means Layer 1 receives 22 million times less gradient signal than Layer 10
- The early layers essentially cannot learn!
- Even after 200 epochs, the ratio is still 245:1

#### 2. Modern Activations Self-Correct
- **ReLU, Leaky ReLU, GELU**: Start with some gradient imbalance
- By epoch 100-200, ratios approach 0.2-1.0 (healthy range)
- The network learns to balance gradient flow through weight adaptation

#### 3. Training Dynamics Visualization

![Training Dynamics Summary](training_dynamics_summary.png)

This comprehensive figure shows:
- **Panel A**: Loss curves showing convergence speed
- **Panel B**: Gradient ratio evolution over training
- **Panel C**: Final learned functions
- **Panels D1-D3**: Gradient flow at epochs 1, 100, 200
- **Panels E1-E3**: Function approximation at epochs 50, 200, 499

### Theoretical Explanation

**Why Sigmoid gradients don't improve**:
- Sigmoid saturates to 0 or 1 for large inputs
- Derivative Οƒ'(z) = Οƒ(z)(1-Οƒ(z)) β†’ 0 when Οƒ(z) β†’ 0 or 1
- Deep layers push activations toward saturation
- Early layers are "locked" and cannot adapt

**Why ReLU/GELU gradients stabilize**:
- Adam optimizer adapts learning rates per-parameter
- Weights adjust to keep activations in "active" region
- Network finds a gradient-friendly configuration

### Practical Implications

1. **Sigmoid is fundamentally broken for deep hidden layers**
   - Not just slow to train, but mathematically unable to learn
   - Early layers receive ~10⁻¹⁰ gradient magnitude

2. **Modern activations are self-healing**
   - Initial gradient imbalance corrects during training
   - Adam optimizer helps by adapting per-parameter learning rates

3. **Monitor gradient ratios during training**
   - Ratio > 100 indicates vanishing gradients
   - Ratio < 0.01 indicates exploding gradients
   - Healthy range: 0.1 to 10

---

## Summary and Recommendations

### Comparison Table

| Property | Best Activations | Worst Activations |
|----------|------------------|-------------------|
| Gradient Flow | LeakyReLU, GELU | Sigmoid, Tanh |
| Avoids Dead Neurons | LeakyReLU, ELU, GELU | ReLU |
| Training Stability | Sigmoid, Tanh, GELU | ReLU (high lr) |
| Smooth Functions | GELU, Swish, Tanh | ReLU |
| Sharp Functions | ReLU, LeakyReLU | Sigmoid |
| Computational Speed | ReLU, LeakyReLU | GELU, Swish |

### Practical Recommendations

1. **Default Choice**: **ReLU** or **LeakyReLU**
   - Simple, fast, effective for most tasks
   - Use LeakyReLU if dead neurons are a concern

2. **For Transformers/Attention**: **GELU**
   - Standard in BERT, GPT, modern transformers
   - Smooth gradients help with optimization

3. **For Very Deep Networks**: **LeakyReLU** or **ELU**
   - Or use residual connections + batch normalization
   - Avoid Sigmoid/Tanh in hidden layers

4. **For Regression with Bounded Outputs**: **Sigmoid** (output layer only)
   - Use for probabilities or [0, 1] outputs
   - Never in hidden layers of deep networks

5. **For RNNs/LSTMs**: **Tanh** (traditional choice)
   - Zero-centered helps with recurrent dynamics
   - Modern alternative: use Transformers instead

### The Big Picture

```
                    ACTIVATION FUNCTION SELECTION GUIDE
                    
    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
    β”‚                     Is it a hidden layer?                    β”‚
    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                              β”‚
              β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
              β–Ό                               β–Ό
           YES                               NO (output layer)
              β”‚                               β”‚
              β–Ό                               β–Ό
    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”             β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
    β”‚ Is it a         β”‚             β”‚ What's the task?    β”‚
    β”‚ Transformer?    β”‚             β”‚                     β”‚
    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜             β”‚ Binary class β†’ Sigmoid
              β”‚                     β”‚ Multi-class β†’ Softmax
      β”Œβ”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”             β”‚ Regression β†’ Linear β”‚
      β–Ό               β–Ό             β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
    YES              NO
      β”‚               β”‚
      β–Ό               β–Ό
    GELU      β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
              β”‚ Worried about   β”‚
              β”‚ dead neurons?   β”‚
              β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                      β”‚
              β”Œβ”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”
              β–Ό               β–Ό
            YES              NO
              β”‚               β”‚
              β–Ό               β–Ό
         LeakyReLU          ReLU
           or ELU
```

---

## Files Generated

| File | Description |
|------|-------------|
| learned_functions.png | Final learned functions vs ground truth |
| loss_curves.png | Training loss curves over 500 epochs |
| gradient_flow.png | Gradient magnitude across layers (epoch 1) |
| gradient_flow_epochs.png | **NEW** Gradient flow at epochs 1, 100, 200 |
| gradient_evolution.png | **NEW** Gradient ratio evolution over training |
| hidden_activations.png | Activation distributions in trained network |
| training_dynamics_functions.png | **NEW** Function learning over time |
| activation_evolution.png | **NEW** Activation distribution evolution |
| training_dynamics_summary.png | **NEW** Comprehensive training dynamics |
| exp1_gradient_flow.png | Gradient magnitude across layers |
| exp2_sparsity_dead_neurons.png | Sparsity and dead neuron rates |
| exp2_activation_distributions.png | Activation value distributions |
| exp3_stability.png | Stability vs learning rate and depth |
| exp4_representational_heatmap.png | MSE heatmap for different targets |
| exp4_predictions.png | Actual predictions vs ground truth |
| summary_figure.png | Comprehensive summary visualization |

---

## References

1. Glorot, X., & Bengio, Y. (2010). Understanding the difficulty of training deep feedforward neural networks.
2. He, K., et al. (2015). Delving deep into rectifiers: Surpassing human-level performance on ImageNet classification.
3. Hendrycks, D., & Gimpel, K. (2016). Gaussian Error Linear Units (GELUs).
4. Ramachandran, P., et al. (2017). Searching for Activation Functions.
5. Nwankpa, C., et al. (2018). Activation Functions: Comparison of trends in Practice and Research for Deep Learning.

---

*Tutorial generated by Orchestra Research Assistant*
*All experiments are reproducible with the provided code*