arxiv:2602.15322

On Surprising Effectiveness of Masking Updates in Adaptive Optimizers

Published on Feb 17

· Submitted by

taesiri on Feb 18

Google

Upvote

Authors:

Abstract

Random parameter update masking achieves superior optimization for large language models by inducing curvature-dependent regularization, with a momentum-aligned variant delivering significant performance improvements over state-of-the-art adaptive optimizers.

AI-generated summary

Training large language models (LLMs) relies almost exclusively on dense adaptive optimizers with increasingly sophisticated preconditioners. We challenge this by showing that randomly masking parameter updates can be highly effective, with a masked variant of RMSProp consistently outperforming recent state-of-the-art optimizers. Our analysis reveals that the random masking induces a curvature-dependent geometric regularization that smooths the optimization trajectory. Motivated by this finding, we introduce Momentum-aligned gradient masking (Magma), which modulates the masked updates using momentum-gradient alignment. Extensive LLM pre-training experiments show that Magma is a simple drop-in replacement for adaptive optimizers with consistent gains and negligible computational overhead. Notably, for the 1B model size, Magma reduces perplexity by over 19\% and 9\% compared to Adam and Muon, respectively.

View arXiv page View PDF Add to collection

Community

taesiri

Paper submitter about 5 hours ago

Randomly masking parameter updates in adaptive optimizers yields curvature-regularization; Magma, momentum-aligned masking, is a drop-in that improves LLM perplexity (about 19% vs Adam, 9% vs Muon) with minimal overhead.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Upvote

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2602.15322 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2602.15322 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2602.15322 in a Space README.md to link it from this page.