Communities
Connect sessions
AI calendar
Organizations
Join Slack
Contact Sales
Search
Open menu
Home
Papers
1705.08741
Cited By
v1
v2 (latest)
Train longer, generalize better: closing the generalization gap in large batch training of neural networks
24 May 2017
Elad Hoffer
Itay Hubara
Daniel Soudry
ODL
Re-assign community
ArXiv (abs)
PDF
HTML
Papers citing
"Train longer, generalize better: closing the generalization gap in large batch training of neural networks"
50 / 465 papers shown
nnMIL: A generalizable multiple instance learning framework for computational pathology
Xiangde Luo
Jinxi Xiang
Yuanfeng Ji
Ruijiang Li
LM&MA
258
0
0
18 Nov 2025
Sharp Minima Can Generalize: A Loss Landscape Perspective On Data
Raymond Fan
Bryce Sandlund
Lin Myat Ko
129
0
0
06 Nov 2025
IBNorm: Information-Bottleneck Inspired Normalization for Representation Learning
Xiandong Zou
Pan Zhou
103
0
0
29 Oct 2025
Position: Many generalization measures for deep learning are fragile
Shuofeng Zhang
A. Louis
AAML
287
0
0
21 Oct 2025
Stochastic Difference-of-Convex Optimization with Momentum
El Mahdi Chayti
Martin Jaggi
123
0
0
20 Oct 2025
DYNAMIX: RL-based Adaptive Batch Size Optimization in Distributed Machine Learning Systems
Yuanjun Dai
Keqiang He
An Wang
124
0
0
09 Oct 2025
Graph Coloring for Multi-Task Learning
Santosh Patapati
264
0
0
21 Sep 2025
On Using Large-Batches in Federated Learning
Sahil Tyagi
FedML
114
0
0
05 Sep 2025
Optimal Condition for Initialization Variance in Deep Neural Networks: An SGD Dynamics Perspective
Hiroshi Horii
Sothea Has
99
0
0
18 Aug 2025
Both Asymptotic and Non-Asymptotic Convergence of Quasi-Hyperbolic Momentum using Increasing Batch Size
Kento Imaizumi
Hideaki Iiduka
232
0
0
30 Jun 2025
NysAct: A Scalable Preconditioned Gradient Descent using Nystrom Approximation
BigData Congress [Services Society] (BSS), 2024
Hyunseok Seung
Jaewoo Lee
Hyunsuk Ko
ODL
302
0
0
10 Jun 2025
Power Lines: Scaling Laws for Weight Decay and Batch Size in LLM Pre-training
Shane Bergsma
Nolan Dey
Gurpreet Gosal
Gavia Gray
Daria Soboleva
Joel Hestness
384
14
0
19 May 2025
Positional Fragility in LLMs: How Offset Effects Reshape Our Understanding of Memorization Risks
Yixuan Xu
Antoni-Joan Solergibert i Llaquet
Antoine Bosselut
Imanol Schlag
366
0
0
19 May 2025
Gradient Descent as a Shrinkage Operator for Spectral Bias
Simon Lucey
252
1
0
25 Apr 2025
Almost Bayesian: The Fractal Dynamics of Stochastic Gradient Descent
Max Hennick
Stijn De Baerdemacker
229
3
0
28 Mar 2025
OmniLearn: A Framework for Distributed Deep Learning over Heterogeneous Clusters
IEEE Transactions on Parallel and Distributed Systems (TPDS), 2025
S. Tyagi
Prateek Sharma
391
2
0
21 Mar 2025
A Multi-Power Law for Loss Curve Prediction Across Learning Rate Schedules
International Conference on Learning Representations (ICLR), 2025
Kairong Luo
Haodong Wen
Shengding Hu
Zhenbo Sun
Zhiyuan Liu
Maosong Sun
Kaifeng Lyu
Wenguang Chen
CLL
289
13
0
17 Mar 2025
A new local time-decoupled squared Wasserstein-2 method for training stochastic neural networks to reconstruct uncertain parameters in dynamical systems
Neural Networks (NN), 2025
Mingtao Xia
Qijing Shen
Philip Maini
Eamonn Gaffney
Alex Mogilner
112
2
0
07 Mar 2025
Straight to Zero: Why Linearly Decaying the Learning Rate to Zero Works Best for LLMs
International Conference on Learning Representations (ICLR), 2025
Shane Bergsma
Nolan Dey
Gurpreet Gosal
Gavia Gray
Daria Soboleva
Joel Hestness
336
22
0
21 Feb 2025
On the use of neural networks for the structural characterization of polymeric porous materials
Jorge Torre
Suset Barroso-Solares
M.A. Rodríguez-Pérez
Javier Pinto
247
7
0
25 Jan 2025
Adaptive Batch Size Schedules for Distributed Training of Language Models with Data and Model Parallelism
Tim Tsz-Kit Lau
Weijian Li
Chenwei Xu
Han Liu
Mladen Kolar
1.2K
0
0
30 Dec 2024
Can Stability be Detrimental? Better Generalization through Gradient Descent Instabilities
Lawrence Wang
Stephen J. Roberts
272
0
0
23 Dec 2024
Unveiling the Secret Recipe: A Guide For Supervised Fine-Tuning Small LLMs
International Conference on Learning Representations (ICLR), 2024
Aldo Pareja
Nikhil Shivakumar Nayak
Hao Wang
Krishnateja Killamsetty
Shivchander Sudalairaj
...
Guangxuan Xu
Kai Xu
Ligong Han
Luke Inglis
Akash Srivastava
439
29
0
17 Dec 2024
Breaking the Memory Barrier: Near Infinite Batch Size Scaling for Contrastive Loss
Zesen Cheng
Hang Zhang
Kehan Li
Sicong Leng
Zhiqiang Hu
Fei Wu
Deli Zhao
Xin Li
Lidong Bing
160
3
0
22 Oct 2024
PLDR-LLM: Large Language Model from Power Law Decoder Representations
Burc Gokden
145
2
0
22 Oct 2024
Evolutionary Retrofitting
ACM Transactions on Evolutionary Learning and Optimization (ACM TELO), 2024
Mathurin Videau
M. Zameshina
Alessandro Leite
Laurent Najman
Marc Schoenauer
O. Teytaud
353
3
0
15 Oct 2024
Convergence of Sharpness-Aware Minimization Algorithms using Increasing Batch Size and Decaying Learning Rate
Hinata Harada
Hideaki Iiduka
259
1
0
16 Sep 2024
Enhancing Large Language Model-based Speech Recognition by Contextualization for Rare and Ambiguous Words
Kento Nozawa
Takashi Masuko
Toru Taniguchi
210
2
0
15 Aug 2024
Safe Semi-Supervised Contrastive Learning Using In-Distribution Data as Positive Examples
IEEE Access (IEEE Access), 2024
Mingu Kwak
Hyungu Kahng
Seoung Bum Kim
231
1
0
03 Aug 2024
Characterizing Dynamical Stability of Stochastic Gradient Descent in Overparameterized Learning
Dennis Chemnitz
Maximilian Engel
279
3
0
29 Jul 2024
Bias of Stochastic Gradient Descent or the Architecture: Disentangling the Effects of Overparameterization of Neural Networks
Amit Peleg
Matthias Hein
281
0
0
04 Jul 2024
Preserving Multilingual Quality While Tuning Query Encoder on English Only
Oleg V. Vasilyev
Randy Sawaya
John Bohannon
562
3
0
01 Jul 2024
Effect of Random Learning Rate: Theoretical Analysis of SGD Dynamics in Non-Convex Optimization via Stationary Distribution
Naoki Yoshida
Shogo H. Nakakita
Masaaki Imaizumi
258
1
0
23 Jun 2024
Communication-Efficient Adaptive Batch Size Strategies for Distributed Local Gradient Methods
Tim Tsz-Kit Lau
Weijian Li
Chenwei Xu
Han Liu
Mladen Kolar
309
3
0
20 Jun 2024
Is Your HD Map Constructor Reliable under Sensor Corruptions?
Xiaoshuai Hao
Mengchuan Wei
Yifan Yang
Haimei Zhao
Hui Zhang
Yi Zhou
Qiang Wang
Weiming Li
Lingdong Kong
Jing Zhang
3DV
263
32
0
18 Jun 2024
Large Stepsize Gradient Descent for Non-Homogeneous Two-Layer Networks: Margin Improvement and Fast Optimization
Yuhang Cai
Jingfeng Wu
Song Mei
Michael Lindsey
Peter L. Bartlett
343
12
0
12 Jun 2024
Empirical Guidelines for Deploying LLMs onto Resource-constrained Edge Devices
Ruiyang Qin
Dancheng Liu
Zheyu Yan
Zhaoxuan Tan
Zixuan Pan
Zhenge Jia
Meng Jiang
Ahmed Abbasi
Jinjun Xiong
Yiyu Shi
271
27
0
06 Jun 2024
Communication-Efficient Distributed Deep Learning via Federated Dynamic Averaging
Michail Theologitis
Georgios Frangias
Georgios Anestis
V. Samoladas
Antonios Deligiannakis
FedML
444
2
0
31 May 2024
Improving Generalization and Convergence by Enhancing Implicit Regularization
Mingze Wang
Haotian He
Jinbo Wang
Zilin Wang
Guanhua Huang
Feiyu Xiong
Zhiyu Li
E. Weinan
Lei Wu
265
12
0
31 May 2024
Train Faster, Perform Better: Modular Adaptive Training in Over-Parameterized Models
Yubin Shi
Yixuan Chen
Mingzhi Dong
Xiaochen Yang
Dongsheng Li
...
Yingying Zhao
Fan Yang
Tun Lu
Ning Gu
L. Shang
MoMe
227
4
0
13 May 2024
PackVFL: Efficient HE Packing for Vertical Federated Learning
Liu Yang
Shuowei Cai
Di Chai
Junxue Zhang
Han Tian
Yilun Jin
Kun Guo
Kai Chen
Qiang Yang
FedML
215
1
0
01 May 2024
Singular-limit analysis of gradient descent with noise injection
Anna Shalova
André Schlichting
M. Peletier
216
5
0
18 Apr 2024
Masked Autoencoders for Microscopy are Scalable Learners of Cellular Biology
Oren Z. Kraus
Kian Kenyon-Dean
Saber Saberian
Maryam Fallah
Peter McLean
...
Chi Vicky Cheng
Kristen Morse
Maureen Makes
Ben Mabey
Berton Earnshaw
211
51
0
16 Apr 2024
Dynamical stability and chaos in artificial neural network trajectories along training
Kaloyan Danovski
Miguel C. Soriano
Lucas Lacasa
239
13
0
08 Apr 2024
Learning to Deliver: a Foundation Model for the Montreal Capacitated Vehicle Routing Problem
Samuel J. K. Chin
Matthias Winkenbach
Akash Srivastava
179
0
0
28 Feb 2024
Principled Architecture-aware Scaling of Hyperparameters
Wuyang Chen
Junru Wu
Zhangyang Wang
Boris Hanin
AI4CE
301
2
0
27 Feb 2024
Investigating the Histogram Loss in Regression
Ehsan Imani
Kai Luedemann
Sam Scholnick-Hughes
Esraa Elelimy
Martha White
UQCV
156
10
0
20 Feb 2024
AdAdaGrad: Adaptive Batch Size Schemes for Adaptive Gradient Methods
Tim Tsz-Kit Lau
Han Liu
Mladen Kolar
ODL
382
9
0
17 Feb 2024
Understanding the Generalization Benefits of Late Learning Rate Decay
International Conference on Artificial Intelligence and Statistics (AISTATS), 2024
Yinuo Ren
Chao Ma
Lexing Ying
AI4CE
261
8
0
21 Jan 2024
AdamL: A fast adaptive gradient method incorporating loss function
Lu Xia
Stefano Massei
ODL
174
3
0
23 Dec 2023
1
2
3
4
...
8
9
10
Next