ResearchTrend.AI
  • Communities
  • Connect sessions
  • AI calendar
  • Organizations
  • Join Slack
  • Contact Sales
Papers
Communities
Social Events
Terms and Conditions
Pricing
Contact Sales
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2026 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 1705.08741
  4. Cited By
Train longer, generalize better: closing the generalization gap in large
  batch training of neural networks
v1v2 (latest)

Train longer, generalize better: closing the generalization gap in large batch training of neural networks

24 May 2017
Elad Hoffer
Itay Hubara
Daniel Soudry
    ODL
ArXiv (abs)PDFHTML

Papers citing "Train longer, generalize better: closing the generalization gap in large batch training of neural networks"

50 / 465 papers shown
On the SDEs and Scaling Rules for Adaptive Gradient Algorithms
On the SDEs and Scaling Rules for Adaptive Gradient AlgorithmsNeural Information Processing Systems (NeurIPS), 2022
Sadhika Malladi
Kaifeng Lyu
A. Panigrahi
Sanjeev Arora
369
81
0
20 May 2022
Large Scale Transfer Learning for Differentially Private Image
  Classification
Large Scale Transfer Learning for Differentially Private Image Classification
Harsh Mehta
Abhradeep Thakurta
Alexey Kurakin
Ashok Cutkosky
231
48
0
06 May 2022
Beyond Lipschitz: Sharp Generalization and Excess Risk Bounds for
  Full-Batch GD
Beyond Lipschitz: Sharp Generalization and Excess Risk Bounds for Full-Batch GDInternational Conference on Learning Representations (ICLR), 2022
Konstantinos E. Nikolakakis
Farzin Haddadpour
Amin Karbasi
Dionysios S. Kalogerias
423
22
0
26 Apr 2022
CowClip: Reducing CTR Prediction Model Training Time from 12 hours to 10
  minutes on 1 GPU
CowClip: Reducing CTR Prediction Model Training Time from 12 hours to 10 minutes on 1 GPUAAAI Conference on Artificial Intelligence (AAAI), 2022
Zangwei Zheng
Peng Xu
Xuan Zou
Da Tang
Zhen Li
...
Xiangzhuo Ding
Fuzhao Xue
Ziheng Qing
Youlong Cheng
Yang You
VLM
334
8
0
13 Apr 2022
DistPro: Searching A Fast Knowledge Distillation Process via Meta
  Optimization
DistPro: Searching A Fast Knowledge Distillation Process via Meta OptimizationEuropean Conference on Computer Vision (ECCV), 2022
XueQing Deng
Dawei Sun
Shawn D. Newsam
Peng Wang
162
10
0
12 Apr 2022
Deep learning, stochastic gradient descent and diffusion maps
Deep learning, stochastic gradient descent and diffusion mapsJournal of Computational Mathematics and Data Science (JCMDS), 2022
Carmina Fjellström
Kaj Nyström
DiffM
224
18
0
04 Apr 2022
Exploiting Explainable Metrics for Augmented SGD
Exploiting Explainable Metrics for Augmented SGDComputer Vision and Pattern Recognition (CVPR), 2022
Mahdi S. Hosseini
Mathieu Tuli
Konstantinos N. Plataniotis
AAML
176
3
0
31 Mar 2022
Small Batch Sizes Improve Training of Low-Resource Neural MT
Small Batch Sizes Improve Training of Low-Resource Neural MTICON (ICON), 2022
Àlex R. Atrio
Andrei Popescu-Belis
189
7
0
20 Mar 2022
Towards understanding deep learning with the natural clustering prior
Towards understanding deep learning with the natural clustering prior
Simon Carbonnelle
180
0
0
15 Mar 2022
On the Pitfalls of Batch Normalization for End-to-End Video Learning: A
  Study on Surgical Workflow Analysis
On the Pitfalls of Batch Normalization for End-to-End Video Learning: A Study on Surgical Workflow Analysis
Dominik Rivoir
Isabel Funke
Stefanie Speidel
336
24
0
15 Mar 2022
Flat minima generalize for low-rank matrix recovery
Flat minima generalize for low-rank matrix recoveryInformation and Inference A Journal of the IMA (JIII), 2022
Lijun Ding
Dmitriy Drusvyatskiy
Maryam Fazel
Zaid Harchaoui
207
29
0
07 Mar 2022
Tensor Programs V: Tuning Large Neural Networks via Zero-Shot
  Hyperparameter Transfer
Tensor Programs V: Tuning Large Neural Networks via Zero-Shot Hyperparameter Transfer
Greg Yang
J. E. Hu
Igor Babuschkin
Szymon Sidor
Xiaodong Liu
David Farhi
Nick Ryder
J. Pachocki
Weizhu Chen
Jianfeng Gao
371
224
0
07 Mar 2022
Regularising for invariance to data augmentation improves supervised
  learning
Regularising for invariance to data augmentation improves supervised learning
Aleksander Botev
Matthias Bauer
Soham De
179
14
0
07 Mar 2022
The Theoretical Expressiveness of Maxpooling
The Theoretical Expressiveness of Maxpooling
Kyle Matoba
Nikolaos Dimitriadis
Franccois Fleuret
FAtt
231
4
0
02 Mar 2022
Extended Unconstrained Features Model for Exploring Deep Neural Collapse
Extended Unconstrained Features Model for Exploring Deep Neural CollapseInternational Conference on Machine Learning (ICML), 2022
Tom Tirer
Joan Bruna
AAML
256
113
0
16 Feb 2022
Black-Box Generalization: Stability of Zeroth-Order Learning
Black-Box Generalization: Stability of Zeroth-Order LearningNeural Information Processing Systems (NeurIPS), 2022
Konstantinos E. Nikolakakis
Farzin Haddadpour
Dionysios S. Kalogerias
Amin Karbasi
MLT
214
2
0
14 Feb 2022
Harmony: Overcoming the Hurdles of GPU Memory Capacity to Train Massive
  DNN Models on Commodity Servers
Harmony: Overcoming the Hurdles of GPU Memory Capacity to Train Massive DNN Models on Commodity ServersProceedings of the VLDB Endowment (PVLDB), 2022
Youjie Li
Amar Phanishayee
D. Murray
Jakub Tarnawski
Nam Sung Kim
249
25
0
02 Feb 2022
Memory-Efficient Backpropagation through Large Linear Layers
Memory-Efficient Backpropagation through Large Linear Layers
Daniel Bershatsky
A. Mikhalev
A. Katrutsa
Julia Gusak
D. Merkulov
Ivan Oseledets
191
5
0
31 Jan 2022
On the Power-Law Hessian Spectrums in Deep Learning
On the Power-Law Hessian Spectrums in Deep Learning
Zeke Xie
Qian-Yuan Tang
Yunfeng Cai
Mingming Sun
P. Li
ODL
194
11
0
31 Jan 2022
Rebalancing Batch Normalization for Exemplar-based Class-Incremental
  Learning
Rebalancing Batch Normalization for Exemplar-based Class-Incremental LearningComputer Vision and Pattern Recognition (CVPR), 2022
Sungmin Cha
Sungjun Cho
Dasol Hwang
Sunwon Hong
Moontae Lee
Taesup Moon
CLL
347
21
0
29 Jan 2022
ScaLA: Accelerating Adaptation of Pre-Trained Transformer-Based Language
  Models via Efficient Large-Batch Adversarial Noise
ScaLA: Accelerating Adaptation of Pre-Trained Transformer-Based Language Models via Efficient Large-Batch Adversarial Noise
Minjia Zhang
U. Niranjan
Yuxiong He
145
1
0
29 Jan 2022
Toward Training at ImageNet Scale with Differential Privacy
Toward Training at ImageNet Scale with Differential Privacy
Alexey Kurakin
Shuang Song
Steve Chien
Roxana Geambasu
Seth Neel
Abhradeep Thakurta
311
114
0
28 Jan 2022
Existence and Estimation of Critical Batch Size for Training Generative
  Adversarial Networks with Two Time-Scale Update Rule
Existence and Estimation of Critical Batch Size for Training Generative Adversarial Networks with Two Time-Scale Update RuleInternational Conference on Machine Learning (ICML), 2022
Naoki Sato
Hideaki Iiduka
EGVM
356
11
0
28 Jan 2022
A Robust Initialization of Residual Blocks for Effective ResNet Training
  without Batch Normalization
A Robust Initialization of Residual Blocks for Effective ResNet Training without Batch NormalizationIEEE Transactions on Neural Networks and Learning Systems (TNNLS), 2021
Enrico Civitelli
Alessio Sortino
Matteo Lapucci
Francesco Bagattini
G. Galvan
OffRLODLOOD
137
4
0
23 Dec 2021
Generalization Bounds for Stochastic Gradient Langevin Dynamics: A
  Unified View via Information Leakage Analysis
Generalization Bounds for Stochastic Gradient Langevin Dynamics: A Unified View via Information Leakage Analysis
Bingzhe Wu
Zhicong Liang
Yatao Bian
Chaochao Chen
Junzhou Huang
Yuan Yao
114
1
0
14 Dec 2021
Non-Asymptotic Analysis of Online Multiplicative Stochastic Gradient
  Descent
Non-Asymptotic Analysis of Online Multiplicative Stochastic Gradient Descent
Riddhiman Bhattacharya
Tiefeng Jiang
334
0
0
14 Dec 2021
DANets: Deep Abstract Networks for Tabular Data Classification and
  Regression
DANets: Deep Abstract Networks for Tabular Data Classification and RegressionAAAI Conference on Artificial Intelligence (AAAI), 2021
Jintai Chen
Kuan-Yu Liao
Yao Wan
Benlin Liu
Jian Wu
LMTD
280
70
0
06 Dec 2021
Loss Landscape Dependent Self-Adjusting Learning Rates in Decentralized
  Stochastic Gradient Descent
Loss Landscape Dependent Self-Adjusting Learning Rates in Decentralized Stochastic Gradient Descent
Wei Zhang
Mingrui Liu
Yu Feng
Xiaodong Cui
Brian Kingsbury
Yuhai Tu
161
3
0
02 Dec 2021
On Large Batch Training and Sharp Minima: A Fokker-Planck Perspective
On Large Batch Training and Sharp Minima: A Fokker-Planck Perspective
Xiaowu Dai
Yuhua Zhu
143
8
0
02 Dec 2021
Training BatchNorm Only in Neural Architecture Search and Beyond
Training BatchNorm Only in Neural Architecture Search and Beyond
Yichen Zhu
Jie Du
Yuqin Zhu
Yi Wang
Zhicai Ou
Feifei Feng
Jian Tang
265
1
0
01 Dec 2021
Hybrid BYOL-ViT: Efficient approach to deal with small datasets
Hybrid BYOL-ViT: Efficient approach to deal with small datasets
Safwen Naimi
Rien van Leeuwen
W. Souidène
S. B. Saoud
SSLViT
120
2
0
08 Nov 2021
Exponential escape efficiency of SGD from sharp minima in non-stationary
  regime
Exponential escape efficiency of SGD from sharp minima in non-stationary regime
Hikaru Ibayashi
Masaaki Imaizumi
298
5
0
07 Nov 2021
Large-Scale Deep Learning Optimizations: A Comprehensive Survey
Large-Scale Deep Learning Optimizations: A Comprehensive Survey
Xiaoxin He
Fuzhao Xue
Xiaozhe Ren
Yang You
326
18
0
01 Nov 2021
Multilayer Lookahead: a Nested Version of Lookahead
Multilayer Lookahead: a Nested Version of Lookahead
Denys Pushkin
Luis Barba
139
2
0
27 Oct 2021
Trade-offs of Local SGD at Scale: An Empirical Study
Trade-offs of Local SGD at Scale: An Empirical Study
Jose Javier Gonzalez Ortiz
Jonathan Frankle
Michael G. Rabbat
Ari S. Morcos
Nicolas Ballas
FedML
222
21
0
15 Oct 2021
What Happens after SGD Reaches Zero Loss? --A Mathematical Framework
What Happens after SGD Reaches Zero Loss? --A Mathematical Framework
Zhiyuan Li
Tianhao Wang
Sanjeev Arora
MLT
353
114
0
13 Oct 2021
Spectral Bias in Practice: The Role of Function Frequency in
  Generalization
Spectral Bias in Practice: The Role of Function Frequency in Generalization
Sara Fridovich-Keil
Raphael Gontijo-Lopes
Rebecca Roelofs
272
44
0
06 Oct 2021
Stochastic Training is Not Necessary for Generalization
Stochastic Training is Not Necessary for Generalization
Jonas Geiping
Micah Goldblum
Phillip E. Pope
Michael Moeller
Tom Goldstein
365
80
0
29 Sep 2021
How to Inject Backdoors with Better Consistency: Logit Anchoring on
  Clean Data
How to Inject Backdoors with Better Consistency: Logit Anchoring on Clean Data
Zhiyuan Zhang
Lingjuan Lyu
Weiqiang Wang
Lichao Sun
Xu Sun
201
39
0
03 Sep 2021
Shift-Curvature, SGD, and Generalization
Shift-Curvature, SGD, and Generalization
Arwen V. Bradley
C. Gomez-Uribe
Manish Reddy Vuyyuru
312
3
0
21 Aug 2021
Logit Attenuating Weight Normalization
Logit Attenuating Weight Normalization
Aman Gupta
R. Ramanath
Jun Shi
Anika Ramachandran
Sirou Zhou
Mingzhou Zhou
S. Keerthi
179
1
0
12 Aug 2021
Online Evolutionary Batch Size Orchestration for Scheduling Deep
  Learning Workloads in GPU Clusters
Online Evolutionary Batch Size Orchestration for Scheduling Deep Learning Workloads in GPU ClustersInternational Conference for High Performance Computing, Networking, Storage and Analysis (SC), 2021
Chen Sun
Shenggui Li
Jinyue Wang
Jun Yu
172
51
0
08 Aug 2021
Simple Modifications to Improve Tabular Neural Networks
Simple Modifications to Improve Tabular Neural Networks
J. Fiedler
LMTD
247
21
0
06 Aug 2021
SGD with a Constant Large Learning Rate Can Converge to Local Maxima
SGD with a Constant Large Learning Rate Can Converge to Local Maxima
Liu Ziyin
Botao Li
James B. Simon
Masakuni Ueda
271
10
0
25 Jul 2021
The Limiting Dynamics of SGD: Modified Loss, Phase Space Oscillations,
  and Anomalous Diffusion
The Limiting Dynamics of SGD: Modified Loss, Phase Space Oscillations, and Anomalous DiffusionNeural Computation (Neural Comput.), 2021
D. Kunin
Javier Sagastuy-Breña
Lauren Gillespie
Eshed Margalit
Hidenori Tanaka
Surya Ganguli
Daniel L. K. Yamins
512
19
0
19 Jul 2021
OODformer: Out-Of-Distribution Detection Transformer
OODformer: Out-Of-Distribution Detection TransformerBritish Machine Vision Conference (BMVC), 2021
Rajat Koner
Poulami Sinhamahapatra
Karsten Roscher
Stephan Günnemann
Volker Tresp
ViT
120
42
0
19 Jul 2021
Globally Convergent Multilevel Training of Deep Residual Networks
Globally Convergent Multilevel Training of Deep Residual Networks
Alena Kopanicáková
Rolf Krause
335
19
0
15 Jul 2021
Automated Learning Rate Scheduler for Large-batch Training
Automated Learning Rate Scheduler for Large-batch Training
Chiheon Kim
Saehoon Kim
Jongmin Kim
Donghoon Lee
Sungwoong Kim
144
22
0
13 Jul 2021
Bag of Tricks for Neural Architecture Search
Bag of Tricks for Neural Architecture Search
T. Elsken
B. Staffler
Arber Zela
J. H. Metzen
Katharina Eggensperger
155
5
0
08 Jul 2021
Never Go Full Batch (in Stochastic Convex Optimization)
Never Go Full Batch (in Stochastic Convex Optimization)Neural Information Processing Systems (NeurIPS), 2021
I Zaghloul Amir
Y. Carmon
Tomer Koren
Roi Livni
218
14
0
29 Jun 2021
Previous
12345...8910
Next