Communities
Connect sessions
AI calendar
Organizations
Join Slack
Contact Sales

Terms and Conditions

Twitter GitHub LinkedIn Bluesky Youtube

© 2026 ResearchTrend.AI, All rights reserved.

Home
Papers
2409.11321
Cited By

SOAP: Improving and Stabilizing Shampoo using Adam

v1v2 (latest)

SOAP: Improving and Stabilizing Shampoo using Adam

17 September 2024

David Brandfonbrener

Sham Kakade

ArXiv (abs)PDF HTML HuggingFace (1 upvotes)Github (259★)

Papers citing "SOAP: Improving and Stabilizing Shampoo using Adam"

50 / 93 papers shown

A Tale of Two Geometries: Adaptive Optimizers and Non-Euclidean Descent

A Tale of Two Geometries: Adaptive Optimizers and Non-Euclidean Descent

238

2

0

25 Nov 2025

Solution of Incompressible Flow Equations with Physics and Equality Constrained Artificial Neural Networks

Solution of Incompressible Flow Equations with Physics and Equality Constrained Artificial Neural Networks

203

0

0

24 Nov 2025

DyKAF: Dynamical Kronecker Approximation of the Fisher Information Matrix for Gradient Preconditioning

DyKAF: Dynamical Kronecker Approximation of the Fisher Information Matrix for Gradient Preconditioning

Ekaterina Grishina

Andrey Veprikov

Alexandr Beznosikov

211

0

0

09 Nov 2025

3D Gaussian Point Encoders

3D Gaussian Point Encoders

253

0

0

06 Nov 2025

Isotropic Curvature Model for Understanding Deep Learning Optimization: Is Gradient Orthogonalization Optimal?

Isotropic Curvature Model for Understanding Deep Learning Optimization: Is Gradient Orthogonalization Optimal?

188

10

0

01 Nov 2025

What Really Matters in Matrix-Whitening Optimizers?

What Really Matters in Matrix-Whitening Optimizers?

189

8

0

28 Oct 2025

How do simple rotations affect the implicit bias of Adam?

How do simple rotations affect the implicit bias of Adam?

Vasileios Charisopoulos

Rebecca Willett

423

0

0

27 Oct 2025

A Unified Perspective on Optimization in Machine Learning and Neuroscience: From Gradient Descent to Neural Adaptation

A Unified Perspective on Optimization in Machine Learning and Neuroscience: From Gradient Descent to Neural Adaptation

Jesus Garcia Fernandez

Marcel van Gerven

289

0

0

21 Oct 2025

SNOO: Step-K Nesterov Outer Optimizer - The Surprising Effectiveness of Nesterov Momentum Applied to Pseudo-Gradients

SNOO: Step-K Nesterov Outer Optimizer - The Surprising Effectiveness of Nesterov Momentum Applied to Pseudo-Gradients

Dominik Kallusky

Vishal Nandavanam

Hao-Jun Michael Shi

183

2

0

17 Oct 2025

Adam or Gauss-Newton? A Comparative Study In Terms of Basis Alignment and SGD Noise

Adam or Gauss-Newton? A Comparative Study In Terms of Basis Alignment and SGD Noise

David Alvarez-Melis

201

2

0

15 Oct 2025

Noise-Adaptive Layerwise Learning Rates: Accelerating Geometry-Aware Optimization for Deep Neural Network Training

Noise-Adaptive Layerwise Learning Rates: Accelerating Geometry-Aware Optimization for Deep Neural Network Training

184

1

0

15 Oct 2025

Preconditioned Norms: A Unified Framework for Steepest Descent, Quasi-Newton and Adaptive Methods

Preconditioned Norms: A Unified Framework for Steepest Descent, Quasi-Newton and Adaptive Methods

Andrey Veprikov

Samuel Horváth

Aleksandr Beznosikov

Slavomír Hanzely

372

0

0

12 Oct 2025

The Potential of Second-Order Optimization for LLMs: A Study with Full Gauss-Newton

The Potential of Second-Order Optimization for LLMs: A Study with Full Gauss-Newton

175

7

0

10 Oct 2025

Adaptive Memory Momentum via a Model-Based Framework for Deep Learning Optimization

Adaptive Memory Momentum via a Model-Based Framework for Deep Learning Optimization

Kristi Topollai

A. Choromańska

399

1

0

06 Oct 2025

QDeepGR4J: Quantile-based ensemble of deep learning and GR4J hybrid rainfall-runoff models for extreme flow prediction with uncertainty quantification

QDeepGR4J: Quantile-based ensemble of deep learning and GR4J hybrid rainfall-runoff models for extreme flow prediction with uncertainty quantification

Rohitash Chandra

142

18

0

06 Oct 2025

Conda: Column-Normalized Adam for Training Large Language Models Faster

Conda: Column-Normalized Adam for Training Large Language Models Faster

283

2

0

29 Sep 2025

Predicting Training Re-evaluation Curves Enables Effective Data Curriculums for LLMs

Predicting Training Re-evaluation Curves Enables Effective Data Curriculums for LLMs

214

0

0

29 Sep 2025

Scaling with Collapse: Efficient and Predictable Training of LLM Families

Scaling with Collapse: Efficient and Predictable Training of LLM Families

Bin Claire Zhang

Shaheer Muhammad

184

4

0

29 Sep 2025

Effective Quantization of Muon Optimizer States

Effective Quantization of Muon Optimizer States

Abhishek Shivanna

D. T. Braithwaite

234

2

0

27 Sep 2025

Understanding SOAP from the Perspective of Gradient Whitening

Understanding SOAP from the Perspective of Gradient Whitening

192

1

0

26 Sep 2025

Incentives in Federated Learning with Heterogeneous Agents

Incentives in Federated Learning with Heterogeneous Agents

Ariel D. Procaccia

201

2

0

25 Sep 2025

AdaGrad Meets Muon: Adaptive Stepsizes for Orthogonal Updates

AdaGrad Meets Muon: Adaptive Stepsizes for Orthogonal Updates

Hayden Schaeffer

276

7

0

03 Sep 2025

Simple Stepsize for Quasi-Newton Methods with Global Convergence Guarantees

Simple Stepsize for Quasi-Newton Methods with Global Convergence Guarantees

Vladislav Ryspayev

Samuel Horváth

Alexander V. Gasnikov

Slavomír Hanzely

141

1

0

27 Aug 2025

ToxiFrench: Benchmarking and Enhancing Language Models via CoT Fine-Tuning for French Toxicity Detection

ToxiFrench: Benchmarking and Enhancing Language Models via CoT Fine-Tuning for French Toxicity Detection

200

0

0

15 Aug 2025

EMA Without the Lag: Bias-Corrected Iterate Averaging Schemes

EMA Without the Lag: Bias-Corrected Iterate Averaging Schemes

202

1

0

31 Jul 2025

Simulating Three-dimensional Turbulence with Physics-informed Neural Networks

Simulating Three-dimensional Turbulence with Physics-informed Neural Networks

226

12

0

11 Jul 2025

Small Batch Size Training for Language Models: When Vanilla SGD Works, and Why Gradient Accumulation Is Wasteful

Small Batch Size Training for Language Models: When Vanilla SGD Works, and Why Gradient Accumulation Is Wasteful

Aditya Somasundaram

483

27

0

09 Jul 2025

GradMetaNet: An Equivariant Architecture for Learning on Gradients

GradMetaNet: An Equivariant Architecture for Learning on Gradients

Michael M. Bronstein

262

3

0

02 Jul 2025

A Stable Whitening Optimizer for Efficient Neural Network Training

A Stable Whitening Optimizer for Efficient Neural Network Training

504

8

0

08 Jun 2025

Purifying Shampoo: Investigating Shampoo's Heuristics by Decomposing its Preconditioner

Purifying Shampoo: Investigating Shampoo's Heuristics by Decomposing its Preconditioner

Runa Eschenhagen

Tsung-Hsien Lee

Hao-Jun Michael Shi

329

13

0

04 Jun 2025

Lions and Muons: Optimization via Stochastic Frank-Wolfe

Lions and Muons: Optimization via Stochastic Frank-Wolfe

Maria-Eleni Sfyraki

766

17

0

04 Jun 2025

Leveraging Coordinate Momentum in SignSGD and Muon: Memory-Optimized Zero-Order

Leveraging Coordinate Momentum in SignSGD and Muon: Memory-Optimized Zero-Order

Grigoriy Evseev

Aleksey Antonov

Andrey Veprikov

Nikolay Bushkov

Nikolay Bushkov

Stanislav Moiseev

485

5

0

04 Jun 2025

Taming LLMs by Scaling Learning Rates with Gradient Grouping

Taming LLMs by Scaling Learning Rates with Gradient Grouping

278

0

0

01 Jun 2025

SUMO: Subspace-Aware Moment-Orthogonalization for Accelerating Memory-Efficient LLM Training

SUMO: Subspace-Aware Moment-Orthogonalization for Accelerating Memory-Efficient LLM Training

Yehonathan Refael

Guy Smorodinsky

Ofir Lindenbaum

235

10

0

30 May 2025

GradPower: Powering Gradients for Faster Language Model Pre-Training

GradPower: Powering Gradients for Faster Language Model Pre-Training

249

2

0

30 May 2025

On the Convergence Analysis of Muon

On the Convergence Analysis of Muon

401

0

0

29 May 2025

Power Lines: Scaling Laws for Weight Decay and Batch Size in LLM Pre-training

Power Lines: Scaling Laws for Weight Decay and Batch Size in LLM Pre-training

489

20

0

19 May 2025

Pairwise Calibrated Rewards for Pluralistic Alignment

Pairwise Calibrated Rewards for Pluralistic Alignment

Ariel D. Procaccia

250

0

0

17 May 2025

Towards Quantifying the Hessian Structure of Neural Networks

Towards Quantifying the Hessian Structure of Neural Networks

368

5

0

05 May 2025

ASGO: Adaptive Structured Gradient Optimization

ASGO: Adaptive Structured Gradient Optimization

536

35

0

26 Mar 2025

Striving for Simplicity: Simple Yet Effective Prior-Aware Pseudo-Labeling for Semi-Supervised Ultrasound Image Segmentation

Striving for Simplicity: Simple Yet Effective Prior-Aware Pseudo-Labeling for Semi-Supervised Ultrasound Image SegmentationInternational Conference on Medical Image Computing and Computer-Assisted Intervention (MICCAI), 2025

504

3

0

18 Mar 2025

Structured Preconditioners in Adaptive Optimization: A Unified Analysis

Structured Preconditioners in Adaptive Optimization: A Unified Analysis

Sashank J. Reddi

316

20

0

13 Mar 2025

CLIMB: Data Foundations for Large Scale Multimodal Clinical Foundation Models

CLIMB: Data Foundations for Large Scale Multimodal Clinical Foundation Models

405

12

0

09 Mar 2025

LapLoss: Laplacian Pyramid-based Multiscale loss for Image Translation

LapLoss: Laplacian Pyramid-based Multiscale loss for Image Translation

Sanskriti Labroo

341

4

0

07 Mar 2025

DEAL-YOLO: Drone-based Efficient Animal Localization using YOLO

DEAL-YOLO: Drone-based Efficient Animal Localization using YOLO

Aditya Prashant Naidu

Shaurya Singh Rathore

213

2

0

06 Mar 2025

Deep Learning is Not So Mysterious or Different

Deep Learning is Not So Mysterious or Different

Andrew Gordon Wilson

483

32

0

03 Mar 2025

NeoBERT: A Next-Generation BERT

NeoBERT: A Next-Generation BERT

Quentin Fournier

Mariam El Mezouar

505

10

0

26 Feb 2025

The Sharpness Disparity Principle in Transformers for Accelerating Language Model Pre-Training

The Sharpness Disparity Principle in Transformers for Accelerating Language Model Pre-Training

528

15

0

26 Feb 2025

COSMOS: A Hybrid Adaptive Optimizer for Memory-Efficient Training of LLMs

COSMOS: A Hybrid Adaptive Optimizer for Memory-Efficient Training of LLMs

1.1K

20

0

24 Feb 2025

Spectral-factorized Positive-definite Curvature Learning for NN Training

Spectral-factorized Positive-definite Curvature Learning for NN Training

Runa Eschenhagen

Richard E. Turner

Roger B. Grosse

629

0

0

10 Feb 2025

Page 1 of 2