Communication-Efficient Adaptive Batch Size Strategies for Distributed
Local Gradient Methods

Communication-Efficient Adaptive Batch Size Strategies for Distributed Local Gradient Methods

20 June 2024

Tim Tsz-Kit Lau

Papers citing "Communication-Efficient Adaptive Batch Size Strategies for Distributed Local Gradient Methods"

9 / 9 papers shown

Title
Adaptive Batch Size Schedules for Distributed Training of Language Models with Data and Model Parallelism Tim Tsz-Kit Lau Weijian Li Chenwei Xu Han Liu Mladen Kolar 62 0 0 30 Dec 2024
Nemotron-4 15B Technical Report Jupinder Parmar Shrimai Prabhumoye Joseph Jennings M. Patwary Sandeep Subramanian ... Ashwath Aithal Oleksii Kuchaiev M. Shoeybi Jonathan Cohen Bryan Catanzaro 31 21 0 26 Feb 2024
AdAdaGrad: Adaptive Batch Size Schemes for Adaptive Gradient Methods Tim Tsz-Kit Lau Han Liu Mladen Kolar ODL 21 6 0 17 Feb 2024
OLMo: Accelerating the Science of Language Models Dirk Groeneveld Iz Beltagy Pete Walsh Akshita Bhagia Rodney Michael Kinney ... Jesse Dodge Kyle Lo Luca Soldaini Noah A. Smith Hanna Hajishirzi OSLM 130 349 0 01 Feb 2024
DiLoCo: Distributed Low-Communication Training of Language Models Arthur Douillard Qixuang Feng Andrei A. Rusu Rachita Chhaparia Yani Donchev A. Kuncoro MarcÁurelio Ranzato Arthur Szlam Jiajun Shen 49 31 0 14 Nov 2023
Adaptive Federated Learning with Auto-Tuned Clients J. Kim Taha Toghani César A. Uribe Anastasios Kyrillidis FedML 24 6 0 19 Jun 2023
On Large-Batch Training for Deep Learning: Generalization Gap and Sharp Minima N. Keskar Dheevatsa Mudigere J. Nocedal M. Smelyanskiy P. T. P. Tang ODL 273 2,878 0 15 Sep 2016
A Proximal Stochastic Gradient Method with Progressive Variance Reduction Lin Xiao Tong Zhang ODL 76 736 0 19 Mar 2014
Optimal Distributed Online Prediction using Mini-Batches O. Dekel Ran Gilad-Bachrach Ohad Shamir Lin Xiao 156 684 0 07 Dec 2010