On the Computational Inefficiency of Large Batch Sizes for Stochastic Gradient Descent

30 November 2018

Papers citing "On the Computational Inefficiency of Large Batch Sizes for Stochastic Gradient Descent"

24 / 24 papers shown

Title
Power Lines: Scaling Laws for Weight Decay and Batch Size in LLM Pre-training Shane Bergsma Nolan Dey Gurpreet Gosal Gavia Gray Daria Soboleva Joel Hestness 27 0 0 19 May 2025
How Does Critical Batch Size Scale in Pre-training? Hanlin Zhang Depen Morwani Nikhil Vyas Jingfeng Wu Difan Zou Udaya Ghai Dean Phillips Foster Sham Kakade 88 11 0 29 Oct 2024
Parallel Split Learning with Global Sampling Mohammad Kohankhaki Ahmad Ayad Mahdi Barhoush A. Schmeink 38 1 0 22 Jul 2024
On Efficient Training of Large-Scale Deep Learning Models: A Literature Review Li Shen Yan Sun Zhiyuan Yu Liang Ding Xinmei Tian Dacheng Tao VLM 37 41 0 07 Apr 2023
Learning Deep Optimal Embeddings with Sinkhorn Divergences S. Roy Yan Han Mehrtash Harandi L. Petersson 25 0 0 14 Sep 2022
ILASR: Privacy-Preserving Incremental Learning for Automatic Speech Recognition at Production Scale Gopinath Chennupati Milind Rao Gurpreet Chadha Aaron Eakin A. Raju ... Andrew Oberlin Buddha Nandanoor Prahalad Venkataramanan Zheng Wu Pankaj Sitpure CLL 32 8 0 19 Jul 2022
Non-Asymptotic Analysis of Online Multiplicative Stochastic Gradient Descent Riddhiman Bhattacharya Tiefeng Jiang 23 0 0 14 Dec 2021
Batch size-invariance for policy optimization Jacob Hilton K. Cobbe John Schulman 33 11 0 01 Oct 2021
Stochastic Training is Not Necessary for Generalization Jonas Geiping Micah Goldblum Phillip E. Pope Michael Moeller Tom Goldstein 91 72 0 29 Sep 2021
Shift-Curvature, SGD, and Generalization Arwen V. Bradley C. Gomez-Uribe Manish Reddy Vuyyuru 35 2 0 21 Aug 2021
On Large-Cohort Training for Federated Learning Zachary B. Charles Zachary Garrett Zhouyuan Huo Sergei Shmulyian Virginia Smith FedML 21 113 0 15 Jun 2021
Layered gradient accumulation and modular pipeline parallelism: fast and efficient training of large language models J. Lamy-Poirier MoE 29 8 0 04 Jun 2021
Improved generalization by noise enhancement Takashi Mori Masahito Ueda 24 3 0 28 Sep 2020
AdaScale SGD: A User-Friendly Algorithm for Distributed Training Tyler B. Johnson Pulkit Agrawal Haijie Gu Carlos Guestrin ODL 30 37 0 09 Jul 2020
Learning Rates as a Function of Batch Size: A Random Matrix Theory Approach to Neural Network Training Diego Granziol S. Zohren Stephen J. Roberts ODL 42 49 0 16 Jun 2020
The Limit of the Batch Size Yang You Yuhui Wang Huan Zhang Zhao-jie Zhang J. Demmel Cho-Jui Hsieh 16 15 0 15 Jun 2020
Stochastic Weight Averaging in Parallel: Large-Batch Training that Generalizes Well Vipul Gupta S. Serrano D. DeCoste MoMe 38 55 0 07 Jan 2020
Distributed Learning of Deep Neural Networks using Independent Subnet Training John Shelton Hyatt Cameron R. Wolfe Michael Lee Yuxin Tang Anastasios Kyrillidis Christopher M. Jermaine OOD 29 35 0 04 Oct 2019
Augment your batch: better training with larger batches Elad Hoffer Tal Ben-Nun Itay Hubara Niv Giladi Torsten Hoefler Daniel Soudry ODL 30 72 0 27 Jan 2019
Implicit Self-Regularization in Deep Neural Networks: Evidence from Random Matrix Theory and Implications for Learning Charles H. Martin Michael W. Mahoney AI4CE 52 193 0 02 Oct 2018
Large batch size training of neural networks with adversarial training and second-order information Z. Yao A. Gholami Daiyaan Arfeen Richard Liaw Joseph E. Gonzalez Kurt Keutzer Michael W. Mahoney ODL 27 42 0 02 Oct 2018
Don't Use Large Mini-Batches, Use Local SGD Tao R. Lin Sebastian U. Stich Kumar Kshitij Patel Martin Jaggi 62 429 0 22 Aug 2018
Rethinking generalization requires revisiting old ideas: statistical mechanics approaches and complex learning behavior Charles H. Martin Michael W. Mahoney AI4CE 30 63 0 26 Oct 2017
On Large-Batch Training for Deep Learning: Generalization Gap and Sharp Minima N. Keskar Dheevatsa Mudigere J. Nocedal M. Smelyanskiy P. T. P. Tang ODL 312 2,900 0 15 Sep 2016