v1v2 (latest)

A Spectral Condition for Feature Learning

26 October 2023

Greg Yang

James B. Simon

Jeremy Bernstein

ArXiv (abs)PDF HTML Github

Papers citing "A Spectral Condition for Feature Learning"

41 / 41 papers shown

Controlling changes to attention logits

Ben Anson

Laurence Aitchison

226

26 Nov 2025

Deep Progressive Training: scaling up depth capacity of zero/one-layer models

Zhiqi Bu

AI4CE

173

07 Nov 2025

Weight Decay may matter more than muP for Learning Rate Transfer in Practice

236

21 Oct 2025

Robust Layerwise Scaling Rules by Proper Weight Decay Tuning

149

17 Oct 2025

AdaPM: a Partial Momentum Algorithm for LLM Training

Yimu Zhang

Yuanshi Liu

Cong Fang

225

10 Oct 2025

POME: Post Optimization Model Edit via Muon-style Projection

131

08 Oct 2025

Spectral Alignment as Predictor of Loss Explosion in Neural Network Training

142

05 Oct 2025

Optimal Scaling Needs Optimal Norm

241

04 Oct 2025

Muon Outperforms Adam in Tail-End Associative Memory Learning

219

30 Sep 2025

Conda: Column-Normalized Adam for Training Large Language Models Faster

294

29 Sep 2025

Beyond Outliers: A Study of Optimizers Under Quantization

312

27 Sep 2025

Understanding Post-Training Structural Changes in Large Language Models

Xinyu He

Xianghui Cao

253

22 Sep 2025

Customizing the Inductive Biases of Softmax Attention using Structured Matrices

177

09 Sep 2025

μ

-Parametrization for Mixture of Experts

...

269

13 Aug 2025

Knowing When to Quit: Probabilistic Early Exits for Speech Separation

Rasmus Malik Høegh Lindrup

Bjørn Sand Jensen

Morten Mørup

UQCV

370

13 Jul 2025

A Stable Whitening Optimizer for Efficient Neural Network Training

Kevin Frans

Sergey Levine

Pieter Abbeel

516

08 Jun 2025

Protocol Models: Scaling Decentralized Training with Communication-Efficient Model Parallelism

Sameera Ramasinghe

Thalaiyasingam Ajanthan

471

02 Jun 2025

SUMO: Subspace-Aware Moment-Orthogonalization for Accelerating Memory-Efficient LLM Training

249

30 May 2025

The Polar Express: Optimal Matrix Sign Methods and Their Application to the Muon Algorithm

331

22 May 2025

ASGO: Adaptive Structured Gradient Optimization

558

26 Mar 2025

Global Convergence and Rich Feature Learning in

L

-Layer Infinite-Width Neural Networks under

μ

307

12 Mar 2025

LORENZA: Enhancing Generalization in Low-Rank Gradient LLM Training via Efficient Zeroth-Order Adaptive SAM

476

26 Feb 2025

Function-Space Learning Rates

Edward Milsom

Ben Anson

Laurence Aitchison

536

24 Feb 2025

Time Transfer: On Optimal Learning Rate and Batch Size In The Infinite Data Limit

427

10 Jan 2025

AdaRankGrad: Adaptive Gradient-Rank and Moments for Memory-Efficient LLMs Training and Fine-TuningInternational Conference on Learning Representations (ICLR), 2024

365

31 Dec 2024

Analyzing & Reducing the Need for Learning Rate Warmup in GPT TrainingNeural Information Processing Systems (NeurIPS), 2024

338

31 Oct 2024

Modular Duality in Deep Learning

Jeremy Bernstein

Laker Newhouse

212

28 Oct 2024

Plastic Learning with Deep Fourier FeaturesInternational Conference on Learning Representations (ICLR), 2024

327

27 Oct 2024

The Optimization Landscape of SGD Across the Feature Learning StrengthInternational Conference on Learning Representations (ICLR), 2024

Alexander B. Atanasov

Alexandru Meterez

James B. Simon

Cengiz Pehlevan

496

06 Oct 2024

Searching for Efficient Linear Layers over a Continuous Space of Structured MatricesNeural Information Processing Systems (NeurIPS), 2024

Andrew Gordon Wilson

297

03 Oct 2024

Old Optimizer, New Norm: An Anthology

Jeremy Bernstein

Laker Newhouse

ODL

394

30 Sep 2024

$u-$\mu$P: The Unit-Scaled Maximal Update Parametrization$

\mu

P: The Unit-Scaled Maximal Update Parametrization

Andres Felipe Cruz Salinas

Carlo Luschi

Samuel Weinbach

Douglas Orr

410

24 Jul 2024

Compute Better Spent: Replacing Dense Layers with Structured Matrices

Andrew Gordon Wilson

311

10 Jun 2024

Get rich quick: exact solutions reveal how unbalanced initializations promote rapid feature learningNeural Information Processing Systems (NeurIPS), 2024

Feng Chen

388

10 Jun 2024

Recurrent neural networks: vanishing and exploding gradients are not the end of the story

Nicolas Zucchet

Antonio Orvieto

ODL AAML

362

31 May 2024

Infinite Limits of Multi-head Transformer Dynamics

431

24 May 2024

Sparse maximal update parameterization: A holistic approach to sparse training dynamics

Nolan Dey

Shane Bergsma

Joel Hestness

391

24 May 2024

Scalable Optimization in the Modular NormNeural Information Processing Systems (NeurIPS), 2024

Yang Liu

320

23 May 2024

GaLore: Memory-Efficient LLM Training by Gradient Low-Rank Projection

Yuandong Tian

564

416

06 Mar 2024

Spike No More: Stabilizing the Pre-training of Large Language Models

535

28 Dec 2023

The Feature Speed Formula: a flexible approach to scale hyper-parameters of deep neural networksNeural Information Processing Systems (NeurIPS), 2023

Lénaic Chizat

Praneeth Netrapalli

527

30 Nov 2023