Abstract
Standard losses interacting with positively biased activation functions cause negative weight drift during early training, leading to significant activation sparsity and affecting model accuracy across various architectures.
The design of modern neural architectures has converged through incremental empirical choices, yet the mechanisms governing their training dynamics remain only partially understood. We identify and analyze a negative weight drift induced by the interaction between standard losses and positively biased activation functions. We prove that under MSE or cross-entropy loss, the gradient with respect to positive pre-activations is non-negative in expectation at initialization, driving downstream weights toward negative values during early training. The drift is intrinsic to optimization rather than data, and persists across architectures (MLP, ResNet, ViT, GPT-nano, MP-SENe) and asymmetric activation functions (ReLU, GELU, SiLU). Coupled with ReLU, weight drift produces activation sparsity reaching up to 90\% in GPT-nano. We characterize the sparsity-accuracy tradeoff across 79 configurations and identify a sharp accuracy cliff above sim70\% activation sparsity. While ReLU^2 achieves a good sparsity--accuracy ratio in GPT-nano, it pathologically amplifies identified activation spikes in intermediate transformer layers. Clipping resolves this while preserving the representational benefits of squaring: clipped ReLU^2 outperforms its unclipped version, and GELU^2 achieves the lowest validation loss on GPT-nano. Code is available at https://github.com/On-Point-RND/BugOrFeature.
Community
Every time you train a network with ReLU, GELU, or SiLU, your weights quietly drift negative. Not because of your data, it happens on random inputs too. It's baked into the math of gradient descent + asymmetric activations.
We prove it formally (MSE & cross-entropy) and show it across MLP, ResNet, ViT, GPT, and a speech model.
What does this drift do? Negative weights push pre-activations into negative regions, and with ReLU, up to 90% of activations end up being zero zeroed out by the very same function that caused the drift in the first place! Bug or feature? Depends on how to use it.
The most interesting finding: ReLU² boosts GPT-nano performance but it pathologically amplifies activation spikes by 25×. The fix is simple: clip it. Clipped ReLU² and GELU² both outperform their non squared versions, with GELU² achieving the best validation loss overall on GPT-nano.
💻 Code: github.com/On-Point-RND/BugOrFeature
Get this paper in your agent:
hf papers read 2605.17659 Don't have the latest CLI?
curl -LsSf https://hf.co/cli/install.sh | bash Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper
Collections including this paper 0
No Collection including this paper