Muon with Nesterov Momentum: Heavy-Tailed Noise and (Randomized) Inexact Polar Decomposition
Abstract
Research develops convergence theory for Muon optimizer with Nesterov momentum and inexact polar decomposition in non-convex matrix optimization under heavy-tailed noise, establishing optimal complexity bounds and proposing efficient randomized low-rank methods.
Most first-order optimizers treat matrix-valued parameters as vectors, ignoring the intrinsic geometry of hidden-layer weights in neural networks. Muon addresses this mismatch by updating along the polar factor of a momentum matrix, but its theoretical understanding has lagged behind practice. In particular, practical implementations incorporate Nesterov momentum, compute the polar factor only approximately, and operate with stochastic gradients that may be heavy-tailed. We close this gap by developing a convergence theory for Muon with Nesterov momentum and inexact polar decomposition in non-convex matrix optimization under heavy-tailed noise. Our analysis builds on a unified framework for inexact polar decomposition that captures practical iterative approximations such as Newton-Schulz and quantifies how their errors propagate through the optimization dynamics. Under this framework, we establish an optimal iteration and sample complexity of O left(varepsilon^{-(3α-2){(α-1)}} right) for finding an varepsilon-stationary point, where αin(1,2] denotes the heavy-tail index. For the inexact-polar setting with σ_1=0, we also provide guarantees that do not require prior knowledge of α. We analyze a randomized low-rank polar decomposition that is substantially more efficient than full-space methods while remaining compatible with our theory. Numerical experiments further demonstrate the effectiveness of the proposed inexact and randomized variants.
Community
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- Gradient Clipping Beyond Vector Norms: A Spectral Approach for Matrix-Valued Parameters (2026)
- MuonEq: Balancing Before Orthogonalization with Lightweight Equilibration (2026)
- DP-Muon: Differentially Private Optimization via Matrix-Orthogonalized Momentum (2026)
- Pro-KLShampoo: Projected KL-Shampoo with Whitening Recovered by Orthogonalization (2026)
- Accelerating Zeroth-Order Spectral Optimization with Partial Orthogonalization from Power Iteration (2026)
- Nora: Normalized Orthogonal Row Alignment for Scalable Matrix Optimizer (2026)
- Muown: Row-Norm Control for Muon Optimization (2026)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend
Get this paper in your agent:
hf papers read 2605.06884 Don't have the latest CLI?
curl -LsSf https://hf.co/cli/install.sh | bash Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper
Collections including this paper 0
No Collection including this paper