ResearchTrend.AI
  • Communities
  • Connect sessions
  • AI calendar
  • Organizations
  • Join Slack
  • Contact Sales
Papers
Communities
Social Events
Terms and Conditions
Pricing
Contact Sales
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2026 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 1705.08741
  4. Cited By
Train longer, generalize better: closing the generalization gap in large
  batch training of neural networks
v1v2 (latest)

Train longer, generalize better: closing the generalization gap in large batch training of neural networks

24 May 2017
Elad Hoffer
Itay Hubara
Daniel Soudry
    ODL
ArXiv (abs)PDFHTML

Papers citing "Train longer, generalize better: closing the generalization gap in large batch training of neural networks"

50 / 465 papers shown
An Empirical Study of Large-Batch Stochastic Gradient Descent with
  Structured Covariance Noise
An Empirical Study of Large-Batch Stochastic Gradient Descent with Structured Covariance Noise
Yeming Wen
Kevin Luk
Maxime Gazeau
Guodong Zhang
Harris Chan
Jimmy Ba
ODL
332
24
0
21 Feb 2019
Random Search and Reproducibility for Neural Architecture Search
Random Search and Reproducibility for Neural Architecture Search
Liam Li
Ameet Talwalkar
OOD
466
778
0
20 Feb 2019
Uniform convergence may be unable to explain generalization in deep
  learning
Uniform convergence may be unable to explain generalization in deep learningNeural Information Processing Systems (NeurIPS), 2019
Vaishnavh Nagarajan
J. Zico Kolter
MoMeAI4CE
434
336
0
13 Feb 2019
Asymmetric Valleys: Beyond Sharp and Flat Local Minima
Asymmetric Valleys: Beyond Sharp and Flat Local MinimaNeural Information Processing Systems (NeurIPS), 2019
Haowei He
Gao Huang
Yang Yuan
ODLMLT
271
158
0
02 Feb 2019
Compressing Gradient Optimizers via Count-Sketches
Compressing Gradient Optimizers via Count-SketchesInternational Conference on Machine Learning (ICML), 2019
Ryan Spring
Anastasios Kyrillidis
Vijai Mohan
Anshumali Shrivastava
151
38
0
01 Feb 2019
Augment your batch: better training with larger batches
Augment your batch: better training with larger batches
Elad Hoffer
Tal Ben-Nun
Itay Hubara
Niv Giladi
Torsten Hoefler
Daniel Soudry
ODL
207
77
0
27 Jan 2019
Traditional and Heavy-Tailed Self Regularization in Neural Network
  Models
Traditional and Heavy-Tailed Self Regularization in Neural Network Models
Charles H. Martin
Michael W. Mahoney
297
145
0
24 Jan 2019
Large-Batch Training for LSTM and Beyond
Large-Batch Training for LSTM and Beyond
Yang You
Jonathan Hseu
Chris Ying
J. Demmel
Kurt Keutzer
Cho-Jui Hsieh
220
96
0
24 Jan 2019
Measurements of Three-Level Hierarchical Structure in the Outliers in
  the Spectrum of Deepnet Hessians
Measurements of Three-Level Hierarchical Structure in the Outliers in the Spectrum of Deepnet Hessians
Vardan Papyan
174
89
0
24 Jan 2019
A Tail-Index Analysis of Stochastic Gradient Noise in Deep Neural
  Networks
A Tail-Index Analysis of Stochastic Gradient Noise in Deep Neural Networks
Umut Simsekli
Levent Sagun
Mert Gurbuzbalaban
499
288
0
18 Jan 2019
Normalized Flat Minima: Exploring Scale Invariant Definition of Flat
  Minima for Neural Networks using PAC-Bayesian Analysis
Normalized Flat Minima: Exploring Scale Invariant Definition of Flat Minima for Neural Networks using PAC-Bayesian Analysis
Yusuke Tsuzuku
Issei Sato
Masashi Sugiyama
255
86
0
15 Jan 2019
CROSSBOW: Scaling Deep Learning with Small Batch Sizes on Multi-GPU
  Servers
CROSSBOW: Scaling Deep Learning with Small Batch Sizes on Multi-GPU Servers
A. Koliousis
Pijika Watcharapichat
Matthias Weidlich
Kai Zou
Paolo Costa
Peter R. Pietzuch
215
71
0
08 Jan 2019
Generalization in Deep Networks: The Role of Distance from
  Initialization
Generalization in Deep Networks: The Role of Distance from Initialization
Vaishnavh Nagarajan
J. Zico Kolter
ODL
194
97
0
07 Jan 2019
Scaling description of generalization with number of parameters in deep
  learning
Scaling description of generalization with number of parameters in deep learning
Mario Geiger
Arthur Jacot
S. Spigler
Franck Gabriel
Levent Sagun
Stéphane dÁscoli
Giulio Biroli
Clément Hongler
Matthieu Wyart
353
204
0
06 Jan 2019
A continuous-time analysis of distributed stochastic gradient
A continuous-time analysis of distributed stochastic gradient
Nicholas M. Boffi
Jean-Jacques E. Slotine
266
16
0
28 Dec 2018
NIPS - Not Even Wrong? A Systematic Review of Empirically Complete
  Demonstrations of Algorithmic Effectiveness in the Machine Learning and
  Artificial Intelligence Literature
NIPS - Not Even Wrong? A Systematic Review of Empirically Complete Demonstrations of Algorithmic Effectiveness in the Machine Learning and Artificial Intelligence Literature
Franz J. Király
Bilal A. Mateen
R. Sonabend
199
10
0
18 Dec 2018
An Empirical Model of Large-Batch Training
An Empirical Model of Large-Batch Training
Sam McCandlish
Jared Kaplan
Dario Amodei
OpenAI Dota Team
893
355
0
14 Dec 2018
Nonlinear Conjugate Gradients For Scaling Synchronous Distributed DNN
  Training
Nonlinear Conjugate Gradients For Scaling Synchronous Distributed DNN Training
Saurabh N. Adya
Vinay Palakkode
Oncel Tuzel
105
4
0
07 Dec 2018
Towards Theoretical Understanding of Large Batch Training in Stochastic
  Gradient Descent
Towards Theoretical Understanding of Large Batch Training in Stochastic Gradient Descent
Xiaowu Dai
Yuhua Zhu
139
12
0
03 Dec 2018
Stochastic Training of Residual Networks: a Differential Equation
  Viewpoint
Stochastic Training of Residual Networks: a Differential Equation Viewpoint
Qi Sun
Yunzhe Tao
Q. Du
163
26
0
01 Dec 2018
On the Computational Inefficiency of Large Batch Sizes for Stochastic
  Gradient Descent
On the Computational Inefficiency of Large Batch Sizes for Stochastic Gradient Descent
Noah Golmant
N. Vemuri
Z. Yao
Vladimir Feinberg
A. Gholami
Kai Rothauge
Michael W. Mahoney
Joseph E. Gonzalez
191
76
0
30 Nov 2018
LEARN Codes: Inventing Low-latency Codes via Recurrent Neural Networks
LEARN Codes: Inventing Low-latency Codes via Recurrent Neural Networks
Yihan Jiang
Hyeji Kim
Himanshu Asnani
Sreeram Kannan
Sewoong Oh
Pramod Viswanath
225
84
0
30 Nov 2018
Large-Scale Distributed Second-Order Optimization Using
  Kronecker-Factored Approximate Curvature for Deep Convolutional Neural
  Networks
Large-Scale Distributed Second-Order Optimization Using Kronecker-Factored Approximate Curvature for Deep Convolutional Neural Networks
Kazuki Osawa
Yohei Tsuji
Yuichiro Ueno
Akira Naruse
Rio Yokota
Satoshi Matsuoka
ODL
328
96
0
29 Nov 2018
Deep learning for pedestrians: backpropagation in CNNs
Deep learning for pedestrians: backpropagation in CNNs
L. Boué
3DVPINN
141
7
0
29 Nov 2018
Neural Sign Language Translation based on Human Keypoint Estimation
Neural Sign Language Translation based on Human Keypoint Estimation
Sang-Ki Ko
Chang Jo Kim
Hyedong Jung
Choongsang Cho
SLR
200
230
0
28 Nov 2018
Deep Frank-Wolfe For Neural Network Optimization
Deep Frank-Wolfe For Neural Network OptimizationInternational Conference on Learning Representations (ICLR), 2018
Leonard Berrada
Andrew Zisserman
M. P. Kumar
ODL
195
41
0
19 Nov 2018
Image Classification at Supercomputer Scale
Image Classification at Supercomputer Scale
Chris Ying
Sameer Kumar
Dehao Chen
Tao Wang
Youlong Cheng
VLM
180
126
0
16 Nov 2018
Massively Distributed SGD: ImageNet/ResNet-50 Training in a Flash
Massively Distributed SGD: ImageNet/ResNet-50 Training in a Flash
Hiroaki Mikami
Hisahiro Suganuma
Pongsakorn U-chupala
Yoshiki Tanaka
Yuichi Kageyama
178
79
0
13 Nov 2018
Measuring the Effects of Data Parallelism on Neural Network Training
Measuring the Effects of Data Parallelism on Neural Network TrainingJournal of machine learning research (JMLR), 2018
Christopher J. Shallue
Jaehoon Lee
J. Antognini
J. Mamou
J. Ketterling
Yao Wang
550
452
0
08 Nov 2018
A Closer Look at Deep Learning Heuristics: Learning rate restarts,
  Warmup and Distillation
A Closer Look at Deep Learning Heuristics: Learning rate restarts, Warmup and Distillation
Akhilesh Deepak Gotmare
N. Keskar
Caiming Xiong
R. Socher
ODL
251
302
0
29 Oct 2018
Three Mechanisms of Weight Decay Regularization
Three Mechanisms of Weight Decay Regularization
Guodong Zhang
Simon Mahns
Bowen Xu
Roger C. Grosse
204
277
0
29 Oct 2018
A jamming transition from under- to over-parametrization affects loss
  landscape and generalization
A jamming transition from under- to over-parametrization affects loss landscape and generalization
S. Spigler
Mario Geiger
Stéphane dÁscoli
Levent Sagun
Giulio Biroli
Matthieu Wyart
392
160
0
22 Oct 2018
A Closer Look at Structured Pruning for Neural Network Compression
A Closer Look at Structured Pruning for Neural Network Compression
Elliot J. Crowley
Jack Turner
Amos Storkey
Michael F. P. O'Boyle
3DPC
190
31
0
10 Oct 2018
Learning to Segment Inputs for NMT Favors Character-Level Processing
Learning to Segment Inputs for NMT Favors Character-Level Processing
Julia Kreutzer
Artem Sokolov
235
32
0
02 Oct 2018
Implicit Self-Regularization in Deep Neural Networks: Evidence from
  Random Matrix Theory and Implications for Learning
Implicit Self-Regularization in Deep Neural Networks: Evidence from Random Matrix Theory and Implications for Learning
Charles H. Martin
Michael W. Mahoney
AI4CE
369
234
0
02 Oct 2018
Large batch size training of neural networks with adversarial training
  and second-order information
Large batch size training of neural networks with adversarial training and second-order information
Z. Yao
A. Gholami
Daiyaan Arfeen
Richard Liaw
Alfons Kemper
Kurt Keutzer
Michael W. Mahoney
ODL
268
46
0
02 Oct 2018
Directional Analysis of Stochastic Gradient Descent via von Mises-Fisher
  Distributions in Deep learning
Directional Analysis of Stochastic Gradient Descent via von Mises-Fisher Distributions in Deep learning
Cheolhyoung Lee
Dong Wang
Wanmo Kang
123
8
0
29 Sep 2018
The jamming transition as a paradigm to understand the loss landscape of
  deep neural networks
The jamming transition as a paradigm to understand the loss landscape of deep neural networksPhysical Review E (PRE), 2018
Mario Geiger
S. Spigler
Stéphane dÁscoli
Levent Sagun
Carlo Albert
Giulio Biroli
Matthieu Wyart
393
152
0
25 Sep 2018
Identifying Generalization Properties in Neural Networks
Identifying Generalization Properties in Neural Networks
Huan Wang
N. Keskar
Caiming Xiong
R. Socher
148
50
0
19 Sep 2018
Removing the Feature Correlation Effect of Multiplicative Noise
Removing the Feature Correlation Effect of Multiplicative Noise
Zijun Zhang
Yining Zhang
Zongpeng Li
169
9
0
19 Sep 2018
Don't Use Large Mini-Batches, Use Local SGD
Don't Use Large Mini-Batches, Use Local SGD
Tao Lin
Sebastian U. Stich
Kumar Kshitij Patel
Martin Jaggi
758
457
0
22 Aug 2018
Large Scale Language Modeling: Converging on 40GB of Text in Four Hours
Large Scale Language Modeling: Converging on 40GB of Text in Four Hours
Raul Puri
Robert M. Kirby
Nikolai Yakovenko
Bryan Catanzaro
139
31
0
03 Aug 2018
Generalization Error in Deep Learning
Generalization Error in Deep Learning
Daniel Jakubovitz
Raja Giryes
M. Rodrigues
AI4CE
462
126
0
03 Aug 2018
A New Benchmark and Progress Toward Improved Weakly Supervised Learning
A New Benchmark and Progress Toward Improved Weakly Supervised LearningBritish Machine Vision Conference (BMVC), 2018
Jason Ramapuram
Russ Webb
SSL
105
3
0
30 Jun 2018
Closing the Generalization Gap of Adaptive Gradient Methods in Training
  Deep Neural Networks
Closing the Generalization Gap of Adaptive Gradient Methods in Training Deep Neural Networks
Jinghui Chen
Dongruo Zhou
Yiqi Tang
Ziyan Yang
Yuan Cao
Quanquan Gu
ODL
359
208
0
18 Jun 2018
Full deep neural network training on a pruned weight budget
Full deep neural network training on a pruned weight budget
Maximilian Golub
G. Lemieux
Mieszko Lis
230
30
0
11 Jun 2018
The Effect of Network Width on the Performance of Large-batch Training
The Effect of Network Width on the Performance of Large-batch Training
Lingjiao Chen
Hongyi Wang
Jinman Zhao
Dimitris Papailiopoulos
Paraschos Koutris
211
22
0
11 Jun 2018
Training Faster by Separating Modes of Variation in Batch-normalized
  Models
Training Faster by Separating Modes of Variation in Batch-normalized Models
Mahdi M. Kalayeh
M. Shah
127
46
0
07 Jun 2018
Implicit regularization and solution uniqueness in over-parameterized
  matrix sensing
Implicit regularization and solution uniqueness in over-parameterized matrix sensing
Kelly Geyer
Anastasios Kyrillidis
A. Kalev
222
4
0
06 Jun 2018
Stochastic Gradient Descent on Separable Data: Exact Convergence with a
  Fixed Learning Rate
Stochastic Gradient Descent on Separable Data: Exact Convergence with a Fixed Learning Rate
Mor Shpigel Nacson
Nathan Srebro
Daniel Soudry
FedMLMLT
283
108
0
05 Jun 2018
Previous
123...10789
Next
Page 8 of 10