v1v2v3v4v5 (latest)

Large Batch Optimization for Deep Learning: Training BERT in 76 minutes

1 April 2019

Yang You

Jing Li

Sashank J. Reddi

Jonathan Hseu

Sanjiv Kumar

Srinadh Bhojanapalli

ArXiv (abs)PDF HTML Github (1698★)

Papers citing "Large Batch Optimization for Deep Learning: Training BERT in 76 minutes"

50 / 647 papers shown

Controlling changes to attention logits

Ben Anson

Laurence Aitchison

228

26 Nov 2025

Advancing Image Classification with Discrete Diffusion Classification Modeling

285

25 Nov 2025

A Circular Argument : Does RoPE need to be Equivariant for Vision?

235

11 Nov 2025

Non-Negative Stiefel Approximating Flow: Orthogonalish Matrix Optimization for Interpretable Embeddings

Brian B. Avants

Nicholas J. Tustison

J. Stone

133

09 Nov 2025

Spin-Adapted Neural Network Wavefunctions in Real Space

132

03 Nov 2025

AI Progress Should Be Measured by Capability-Per-Resource, Not Scale Alone: A Framework for Gradient-Guided Resource Allocation in LLMs

David McCoy

Yulun Wu

Zachary Butzin-Dozier

153

02 Nov 2025

Exploring Landscapes for Better Minima along Valleys

140

31 Oct 2025

Relative Scaling Laws for LLMs

William B. Held

David Leo Wright Hall

Abigail Z. Jacobs

Diyi Yang

236

28 Oct 2025

On Optimal Hyperparameters for Differentially Private Deep Transfer Learning

169

23 Oct 2025

A Scalable, Causal, and Energy Efficient Framework for Neural Decoding with Spiking Neural Networks

Georgios Mentzelopoulos

198

23 Oct 2025

HyperDiffusionFields (HyDiF): Diffusion-Guided Hypernetworks for Learning Implicit Molecular Neural Fields

220

20 Oct 2025

Seesaw: Accelerating Training by Balancing Learning Rate and Batch Size Scheduling

Alexandru Meterez

Depen Morwani

Jingfeng Wu

Costin-Andrei Oncescu

Cengiz Pehlevan

Sham Kakade

LRM

198

16 Oct 2025

Noise-Adaptive Layerwise Learning Rates: Accelerating Geometry-Aware Optimization for Deep Neural Network Training

207

15 Oct 2025

DYNAMIX: RL-based Adaptive Batch Size Optimization in Distributed Machine Learning Systems

Yuanjun Dai

Keqiang He

An Wang

162

09 Oct 2025

Adaptive Memory Momentum via a Model-Based Framework for Deep Learning Optimization

Kristi Topollai

A. Choromańska

ODL

419

06 Oct 2025

Integrating Offline Pre-Training with Online Fine-Tuning: A Reinforcement Learning Approach for Robot Social Navigation

274

01 Oct 2025

Conda: Column-Normalized Adam for Training Large Language Models Faster

297

29 Sep 2025

Data-Efficient Training by Evolved Sampling

Ziheng Cheng

Zhong Li

Jiang Bian

190

27 Sep 2025

Development of Deep Learning Optimizers: Approaches, Concepts, and Update Rules

Doğay Altınel

202

22 Sep 2025

Profiling LoRA/QLoRA Fine-Tuning Efficiency on Consumer GPUs: An RTX 4060 Case Study

MSR Avinash

100

07 Sep 2025

On Using Large-Batches in Federated Learning

Sahil Tyagi

FedML

149

05 Sep 2025

MERIT: Maximum-normalized Element-wise Ratio for Language Model Large-batch Training

198

28 Aug 2025

When Routers, Switches and Interconnects Compute: A processing-in-interconnect Paradigm for Scalable Neuromorphic AI

Madhuvanthi Srivatsav R

Chiranjib Bhattacharyya

S. Chakrabartty

Chetan Singh Thakur

171

27 Aug 2025

Tri-Accel: Curvature-Aware Precision-Adaptive and Memory-Elastic Optimization for Efficient GPU Usage

265

23 Aug 2025

Towards Reliable and Generalizable Differentially Private Machine Learning (Extended Version)

Wenxuan Bao

Vincent Bindschaedler

AAML

318

21 Aug 2025

MAESTRO: Masked AutoEncoders for Multimodal, Multitemporal, and Multispectral Earth Observation Data

319

14 Aug 2025

Learning to See and Act: Task-Aware Virtual View Exploration for Robotic Manipulation

...

499

07 Aug 2025

Slice or the Whole Pie? Utility Control for AI Models

Ye Tao

AAML

136

06 Aug 2025

Fishers for Free? Approximating the Fisher Information Matrix by Recycling the Squared Gradient Accumulator

326

24 Jul 2025

Small Batch Size Training for Language Models: When Vanilla SGD Works, and Why Gradient Accumulation Is Wasteful

499

09 Jul 2025

Both Asymptotic and Non-Asymptotic Convergence of Quasi-Hyperbolic Momentum using Increasing Batch Size

Kento Imaizumi

Hideaki Iiduka

277

30 Jun 2025

An Adaptive Method Stabilizing Activations for Enhanced Generalization

358

10 Jun 2025

Investigating Mask-aware Prototype Learning for Tabular Anomaly Detection

241

03 Jun 2025

Taming LLMs by Scaling Learning Rates with Gradient Grouping

284

01 Jun 2025

SUMO: Subspace-Aware Moment-Orthogonalization for Accelerating Memory-Efficient LLM Training

249

30 May 2025

On the Convergence Analysis of Muon

455

29 May 2025

DES-LOC: Desynced Low Communication Adaptive Optimizers for Training Foundation Models

...

508

28 May 2025

Deep Learning-Based Forecasting of Boarding Patient Counts to Address ED Overcrowding

236

20 May 2025

A Physics-Inspired Optimizer: Velocity Regularized Adam

547

19 May 2025

Power Lines: Scaling Laws for Weight Decay and Batch Size in LLM Pre-training

506

19 May 2025

$On the $O(\frac{\sqrt{d}}{K^{1/4}})$ Convergence Rate of AdamW Measured by $\ell_1$ Norm$

On the

O(\frac{\sqrt{d}}{K^{1/4}})

Convergence Rate of AdamW Measured by

\ell_1

564

17 May 2025

Pretraining Large Brain Language Model for Active BCI: Silent Speech

...

562

29 Apr 2025

AlphaGrad: Non-Linear Gradient Normalization Optimizer

Soham Sane

ODL

442

22 Apr 2025

Learning Energy-Based Generative Models via Potential Flow: A Variational Principle Approach to Probability Density Homotopy Matching

361

22 Apr 2025

Perception Encoder: The best visual embeddings are not at the output of the network

Daniel Bolya

Po-Yao (Bernie) Huang

...

Christoph Feichtenhofer

ObjD VOS

830

197

17 Apr 2025

Lumos: Efficient Performance Modeling and Estimation for Large-scale LLM Training

423

12 Apr 2025

Low-Bit Integerization of Vision Transformers using Operand Reordering for Efficient Hardware

Ching-Yi Lin

Sahil Shah

342

11 Apr 2025

Neural Encoding and Decoding at Scale

International Brain Laboratory

635

11 Apr 2025

The Efficacy of Semantics-Preserving Transformations in Self-Supervised Learning for Medical Ultrasound

297

10 Apr 2025

MultiNeRF: Multiple Watermark Embedding for Neural Radiance Fields

Yash Kulthe

Andrew Gilbert

John Collomosse

383

03 Apr 2025