MuonEq: Balancing Before Orthogonalization with Lightweight Equilibration
Abstract
Orthogonalized-update optimizers can be enhanced through pre-orthogonalization equilibration schemes that improve training geometry and convergence properties for matrix-valued parameters.
Orthogonalized-update optimizers such as Muon improve training of matrix-valued parameters, but existing extensions typically either rescale updates after orthogonalization or use heavier whitening-based preconditioners before it. We introduce {\method}, a lightweight family of pre-orthogonalization equilibration schemes for Muon with three forms: two-sided row/column normalization (RC), row normalization (R), and column normalization (C). By rebalancing the momentum matrix before finite-step Newton--Schulz orthogonalization, {\method} improves the geometry seen by orthogonalization. We show that finite-step orthogonalization is governed by the input spectrum, especially stable rank and condition number, and that row/column normalization acts as a zeroth-order surrogate for whitening. For hidden matrix weights, R is the default variant. Theoretically, {\method} (R) retains the standard mathcal O(T^{-1/4}) Muon-type nonconvex stationarity guarantee with decoupled weight decay and a horizon-free diminishing learning-rate schedule, and extends it to finite-step NS5 up to an explicit inexactness constant. In LLaMA2 pretraining on C4, {\method} (R) consistently outperforms Muon on 130M, 350M, and 1B models, with faster convergence and lower validation perplexity. The code is available at the https://github.com/MaeChd/muon-eq{MuonEq codebase}.
Community
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- Muon with Nesterov Momentum: Heavy-Tailed Noise and (Randomized) Inexact Polar Decomposition (2026)
- Nora: Normalized Orthogonal Row Alignment for Scalable Matrix Optimizer (2026)
- Muown: Row-Norm Control for Muon Optimization (2026)
- Pro-KLShampoo: Projected KL-Shampoo with Whitening Recovered by Orthogonalization (2026)
- OrScale: Orthogonalised Optimization with Layer-Wise Trust-Ratio Scaling (2026)
- DP-Muon: Differentially Private Optimization via Matrix-Orthogonalized Momentum (2026)
- Spectral Flattening Is All Muon Needs: How Orthogonalization Controls Learning Rate and Convergence (2026)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend
Get this paper in your agent:
hf papers read 2603.28254 Don't have the latest CLI?
curl -LsSf https://hf.co/cli/install.sh | bash Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper
Collections including this paper 0
No Collection including this paper