ResearchTrend.AI
  • Communities
  • Connect sessions
  • AI calendar
  • Organizations
  • Join Slack
  • Contact Sales
Papers
Communities
Social Events
Terms and Conditions
Pricing
Contact Sales
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2026 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2409.11321
  4. Cited By
SOAP: Improving and Stabilizing Shampoo using Adam
v1v2 (latest)

SOAP: Improving and Stabilizing Shampoo using Adam

17 September 2024
Nikhil Vyas
Depen Morwani
Rosie Zhao
Itai Shapira
David Brandfonbrener
Lucas Janson
Sham Kakade
Sham Kakade
ArXiv (abs)PDFHTMLHuggingFace (1 upvotes)Github (259★)

Papers citing "SOAP: Improving and Stabilizing Shampoo using Adam"

50 / 93 papers shown
A Tale of Two Geometries: Adaptive Optimizers and Non-Euclidean Descent
A Tale of Two Geometries: Adaptive Optimizers and Non-Euclidean Descent
Shuo Xie
Tianhao Wang
Beining Wu
Zhiyuan Li
238
2
0
25 Nov 2025
Solution of Incompressible Flow Equations with Physics and Equality Constrained Artificial Neural Networks
Solution of Incompressible Flow Equations with Physics and Equality Constrained Artificial Neural Networks
Qifeng Hu
Inanc Senocak
203
0
0
24 Nov 2025
DyKAF: Dynamical Kronecker Approximation of the Fisher Information Matrix for Gradient Preconditioning
DyKAF: Dynamical Kronecker Approximation of the Fisher Information Matrix for Gradient Preconditioning
Nikolay Yudin
Ekaterina Grishina
Andrey Veprikov
Alexandr Beznosikov
Maxim Rakhuba
211
0
0
09 Nov 2025
3D Gaussian Point Encoders
3D Gaussian Point Encoders
Jim James
Ben Wilson
Simon Lucey
James Hays
3DPC3DV
253
0
0
06 Nov 2025
Isotropic Curvature Model for Understanding Deep Learning Optimization: Is Gradient Orthogonalization Optimal?
Isotropic Curvature Model for Understanding Deep Learning Optimization: Is Gradient Orthogonalization Optimal?
Weijie Su
188
10
0
01 Nov 2025
What Really Matters in Matrix-Whitening Optimizers?
What Really Matters in Matrix-Whitening Optimizers?
Kevin Frans
Pieter Abbeel
Sergey Levine
189
8
0
28 Oct 2025
How do simple rotations affect the implicit bias of Adam?
How do simple rotations affect the implicit bias of Adam?
Adela DePavia
Vasileios Charisopoulos
Rebecca Willett
ODL
423
0
0
27 Oct 2025
A Unified Perspective on Optimization in Machine Learning and Neuroscience: From Gradient Descent to Neural Adaptation
A Unified Perspective on Optimization in Machine Learning and Neuroscience: From Gradient Descent to Neural Adaptation
Jesus Garcia Fernandez
Nasir Ahmad
Marcel van Gerven
AI4CE
289
0
0
21 Oct 2025
SNOO: Step-K Nesterov Outer Optimizer - The Surprising Effectiveness of Nesterov Momentum Applied to Pseudo-Gradients
SNOO: Step-K Nesterov Outer Optimizer - The Surprising Effectiveness of Nesterov Momentum Applied to Pseudo-Gradients
Dominik Kallusky
Vinay Rao
Vishal Nandavanam
Hao-Jun Michael Shi
183
2
0
17 Oct 2025
Adam or Gauss-Newton? A Comparative Study In Terms of Basis Alignment and SGD Noise
Adam or Gauss-Newton? A Comparative Study In Terms of Basis Alignment and SGD Noise
Bingbin Liu
Rachit Bansal
Depen Morwani
Nikhil Vyas
David Alvarez-Melis
Sham Kakade
201
2
0
15 Oct 2025
Noise-Adaptive Layerwise Learning Rates: Accelerating Geometry-Aware Optimization for Deep Neural Network Training
Noise-Adaptive Layerwise Learning Rates: Accelerating Geometry-Aware Optimization for Deep Neural Network Training
Jie Hao
Xiaochuan Gong
Jie Xu
Z. Wang
Mingrui Liu
AI4CE
184
1
0
15 Oct 2025
Preconditioned Norms: A Unified Framework for Steepest Descent, Quasi-Newton and Adaptive Methods
Preconditioned Norms: A Unified Framework for Steepest Descent, Quasi-Newton and Adaptive Methods
Andrey Veprikov
Arman Bolatov
Samuel Horváth
Aleksandr Beznosikov
Martin Takáč
Slavomír Hanzely
ODL
372
0
0
12 Oct 2025
The Potential of Second-Order Optimization for LLMs: A Study with Full Gauss-Newton
The Potential of Second-Order Optimization for LLMs: A Study with Full Gauss-Newton
Natalie Abreu
Nikhil Vyas
Sham Kakade
Depen Morwani
175
7
0
10 Oct 2025
Adaptive Memory Momentum via a Model-Based Framework for Deep Learning Optimization
Adaptive Memory Momentum via a Model-Based Framework for Deep Learning Optimization
Kristi Topollai
A. Choromańska
ODL
399
1
0
06 Oct 2025
QDeepGR4J: Quantile-based ensemble of deep learning and GR4J hybrid rainfall-runoff models for extreme flow prediction with uncertainty quantification
QDeepGR4J: Quantile-based ensemble of deep learning and GR4J hybrid rainfall-runoff models for extreme flow prediction with uncertainty quantification
Arpit Kapoor
Rohitash Chandra
142
18
0
06 Oct 2025
Conda: Column-Normalized Adam for Training Large Language Models Faster
Conda: Column-Normalized Adam for Training Large Language Models Faster
Junjie Wang
Pan Zhou
Yiming Dong
Huan Li
Jia Li
Xun Zhou
Qicheng Lao
Cong Fang
Zhouchen Lin
AI4CE
283
2
0
29 Sep 2025
Predicting Training Re-evaluation Curves Enables Effective Data Curriculums for LLMs
Predicting Training Re-evaluation Curves Enables Effective Data Curriculums for LLMs
Shane Bergsma
Nolan Dey
Joel Hestness
214
0
0
29 Sep 2025
Scaling with Collapse: Efficient and Predictable Training of LLM Families
Scaling with Collapse: Efficient and Predictable Training of LLM Families
Shane Bergsma
Bin Claire Zhang
Nolan Dey
Shaheer Muhammad
Gurpreet Gosal
Joel Hestness
184
4
0
29 Sep 2025
Effective Quantization of Muon Optimizer States
Effective Quantization of Muon Optimizer States
Aman Gupta
Rafael Celente
Abhishek Shivanna
D. T. Braithwaite
Gregory Dexter
Shao Tang
Hiroto Udagawa
Daniel Silva
R. Ramanath
S. Keerthi
MQ
234
2
0
27 Sep 2025
Understanding SOAP from the Perspective of Gradient Whitening
Understanding SOAP from the Perspective of Gradient Whitening
Yanqing Lu
Letao Wang
Jinbo Liu
FAtt
192
1
0
26 Sep 2025
Incentives in Federated Learning with Heterogeneous Agents
Incentives in Federated Learning with Heterogeneous Agents
Ariel D. Procaccia
Han Shao
Itai Shapira
FedML
201
2
0
25 Sep 2025
AdaGrad Meets Muon: Adaptive Stepsizes for Orthogonal Updates
AdaGrad Meets Muon: Adaptive Stepsizes for Orthogonal Updates
Minxin Zhang
Yuxuan Liu
Hayden Schaeffer
276
7
0
03 Sep 2025
Simple Stepsize for Quasi-Newton Methods with Global Convergence Guarantees
Simple Stepsize for Quasi-Newton Methods with Global Convergence Guarantees
A. Agafonov
Vladislav Ryspayev
Samuel Horváth
Alexander V. Gasnikov
Martin Takáč
Slavomír Hanzely
141
1
0
27 Aug 2025
ToxiFrench: Benchmarking and Enhancing Language Models via CoT Fine-Tuning for French Toxicity Detection
ToxiFrench: Benchmarking and Enhancing Language Models via CoT Fine-Tuning for French Toxicity Detection
Axel Delaval
Shujian Yang
Huaimin Wang
Han Qiu
Jialiang Lu
200
0
0
15 Aug 2025
EMA Without the Lag: Bias-Corrected Iterate Averaging Schemes
EMA Without the Lag: Bias-Corrected Iterate Averaging Schemes
Adam Block
Cyril Zhang
202
1
0
31 Jul 2025
Simulating Three-dimensional Turbulence with Physics-informed Neural Networks
Simulating Three-dimensional Turbulence with Physics-informed Neural Networks
Sifan Wang
Shyam Sankaran
Xiantao Fan
P. Stinis
P. Perdikaris
PINNAI4CE
226
12
0
11 Jul 2025
Small Batch Size Training for Language Models: When Vanilla SGD Works, and Why Gradient Accumulation Is Wasteful
Small Batch Size Training for Language Models: When Vanilla SGD Works, and Why Gradient Accumulation Is Wasteful
Martin Marek
Sanae Lotfi
Aditya Somasundaram
A. Wilson
Micah Goldblum
LRM
483
27
0
09 Jul 2025
GradMetaNet: An Equivariant Architecture for Learning on Gradients
GradMetaNet: An Equivariant Architecture for Learning on Gradients
Yoav Gelberg
Yam Eitan
Aviv Navon
Aviv Shamsian
Theo
Putterman
Michael M. Bronstein
Haggai Maron
262
3
0
02 Jul 2025
A Stable Whitening Optimizer for Efficient Neural Network Training
A Stable Whitening Optimizer for Efficient Neural Network Training
Kevin Frans
Sergey Levine
Pieter Abbeel
504
8
0
08 Jun 2025
Purifying Shampoo: Investigating Shampoo's Heuristics by Decomposing its Preconditioner
Purifying Shampoo: Investigating Shampoo's Heuristics by Decomposing its Preconditioner
Runa Eschenhagen
Aaron Defazio
Tsung-Hsien Lee
Richard Turner
Hao-Jun Michael Shi
329
13
0
04 Jun 2025
Lions and Muons: Optimization via Stochastic Frank-Wolfe
Lions and Muons: Optimization via Stochastic Frank-Wolfe
Maria-Eleni Sfyraki
Jun-Kun Wang
766
17
0
04 Jun 2025
Leveraging Coordinate Momentum in SignSGD and Muon: Memory-Optimized Zero-Order
Leveraging Coordinate Momentum in SignSGD and Muon: Memory-Optimized Zero-Order
Egor Petrov
Grigoriy Evseev
Aleksey Antonov
Andrey Veprikov
Nikolay Bushkov
Nikolay Bushkov
Stanislav Moiseev
485
5
0
04 Jun 2025
Taming LLMs by Scaling Learning Rates with Gradient Grouping
Taming LLMs by Scaling Learning Rates with Gradient Grouping
Siyuan Li
Juanxi Tian
Zedong Wang
Xin Jin
Zicheng Liu
Wentao Zhang
Dan Xu
278
0
0
01 Jun 2025
SUMO: Subspace-Aware Moment-Orthogonalization for Accelerating Memory-Efficient LLM Training
SUMO: Subspace-Aware Moment-Orthogonalization for Accelerating Memory-Efficient LLM Training
Yehonathan Refael
Guy Smorodinsky
Tom Tirer
Ofir Lindenbaum
235
10
0
30 May 2025
GradPower: Powering Gradients for Faster Language Model Pre-Training
GradPower: Powering Gradients for Faster Language Model Pre-Training
Mingze Wang
Jinbo Wang
Jiaqi Zhang
Wei Wang
Peng Pei
Xunliang Cai
Weinan E
Lei Wu
249
2
0
30 May 2025
On the Convergence Analysis of Muon
On the Convergence Analysis of Muon
Wei Shen
Ruichuan Huang
Minhui Huang
Cong Shen
Jiawei Zhang
401
0
0
29 May 2025
Power Lines: Scaling Laws for Weight Decay and Batch Size in LLM Pre-training
Power Lines: Scaling Laws for Weight Decay and Batch Size in LLM Pre-training
Shane Bergsma
Nolan Dey
Gurpreet Gosal
Gavia Gray
Daria Soboleva
Joel Hestness
489
20
0
19 May 2025
Pairwise Calibrated Rewards for Pluralistic Alignment
Pairwise Calibrated Rewards for Pluralistic Alignment
Daniel Halpern
Evi Micha
Ariel D. Procaccia
Itai Shapira
250
0
0
17 May 2025
Towards Quantifying the Hessian Structure of Neural Networks
Towards Quantifying the Hessian Structure of Neural Networks
Zhaorui Dong
Yushun Zhang
Jianfeng Yao
Jianfeng Yao
368
5
0
05 May 2025
ASGO: Adaptive Structured Gradient Optimization
ASGO: Adaptive Structured Gradient Optimization
Kang An
Yuxing Liu
Boyao Wang
Shiqian Ma
Shiqian Ma
Tong Zhang
Tong Zhang
ODL
536
35
0
26 Mar 2025
Striving for Simplicity: Simple Yet Effective Prior-Aware Pseudo-Labeling for Semi-Supervised Ultrasound Image Segmentation
Striving for Simplicity: Simple Yet Effective Prior-Aware Pseudo-Labeling for Semi-Supervised Ultrasound Image SegmentationInternational Conference on Medical Image Computing and Computer-Assisted Intervention (MICCAI), 2025
Yaxiong Chen
Yujie Wang
Zixuan Zheng
Jingliang Hu
Yilei Shi
Shengwu Xiong
Xiao Xiang Zhu
Lichao Mou
504
3
0
18 Mar 2025
Structured Preconditioners in Adaptive Optimization: A Unified Analysis
Structured Preconditioners in Adaptive Optimization: A Unified Analysis
Shuo Xie
Tianhao Wang
Sashank J. Reddi
Sanjiv Kumar
Zhiyuan Li
316
20
0
13 Mar 2025
CLIMB: Data Foundations for Large Scale Multimodal Clinical Foundation Models
CLIMB: Data Foundations for Large Scale Multimodal Clinical Foundation Models
Wei Dai
Peilin Chen
Malinda Lu
Daniel Li
Haowen Wei
Hejie Cui
Paul Pu Liang
LM&MA
405
12
0
09 Mar 2025
LapLoss: Laplacian Pyramid-based Multiscale loss for Image Translation
LapLoss: Laplacian Pyramid-based Multiscale loss for Image Translation
Krish Didwania
Ishaan Gakhar
Prakhar Arya
Sanskriti Labroo
341
4
0
07 Mar 2025
DEAL-YOLO: Drone-based Efficient Animal Localization using YOLO
DEAL-YOLO: Drone-based Efficient Animal Localization using YOLO
Aditya Prashant Naidu
Hem Gosalia
Ishaan Gakhar
Shaurya Singh Rathore
Krish Didwania
Ujjwal Verma
213
2
0
06 Mar 2025
Deep Learning is Not So Mysterious or Different
Deep Learning is Not So Mysterious or Different
Andrew Gordon Wilson
483
32
0
03 Mar 2025
NeoBERT: A Next-Generation BERT
NeoBERT: A Next-Generation BERT
Lola Le Breton
Quentin Fournier
Mariam El Mezouar
John X. Morris
Sarath Chandar
AI4TS
505
10
0
26 Feb 2025
The Sharpness Disparity Principle in Transformers for Accelerating Language Model Pre-Training
The Sharpness Disparity Principle in Transformers for Accelerating Language Model Pre-Training
Jinbo Wang
Mingze Wang
Zhanpeng Zhou
Junchi Yan
Weinan E
Lei Wu
528
15
0
26 Feb 2025
COSMOS: A Hybrid Adaptive Optimizer for Memory-Efficient Training of LLMs
COSMOS: A Hybrid Adaptive Optimizer for Memory-Efficient Training of LLMs
Liming Liu
Zhenghao Xu
Zixuan Zhang
Hao Kang
Zichong Li
Chen Liang
Weizhu Chen
T. Zhao
1.1K
20
0
24 Feb 2025
Spectral-factorized Positive-definite Curvature Learning for NN Training
Spectral-factorized Positive-definite Curvature Learning for NN Training
Wu Lin
Felix Dangel
Runa Eschenhagen
Juhan Bae
Richard E. Turner
Roger B. Grosse
629
0
0
10 Feb 2025
12
Next
Page 1 of 2