v1v2v3 (latest)

MARS-M: When Variance Reduction Meets Matrices

20 October 2025

Yifeng Liu

Angela Yuan

Q. Gu

ArXiv (abs)PDF HTML HuggingFace (2 upvotes)Github (714★)

Main:10 Pages

11 Figures

Bibliography:7 Pages

16 Tables

Appendix:9 Pages

Abstract

Matrix-based preconditioned optimizers, such as Muon, have recently been shown to be more efficient than scalar-based optimizers for training large-scale neural networks, including large language models (LLMs). Recent benchmark studies of LLM pretraining optimizers have demonstrated that variance-reduction techniques such as MARS can substantially speed up training compared with standard optimizers that do not employ variance reduction. In this paper, we introduce MARS-M, a new optimizer that integrates MARS-style variance reduction with Muon. Under standard regularity conditions, we prove that MARS-M converges to a first-order stationary point at a rate of $\tilde{\mathcal{O}}(T^{-1/3})$ , improving upon the $\tilde{\mathcal{O}}(T^{-1/4})$ rate attained by Muon. Empirical results on language modeling and computer vision tasks demonstrate that MARS-M consistently yields lower losses and improved performance across various downstream benchmarks. The implementation of MARS-M is available atthis https URL.

View on arXiv

Comments on this paper