ResearchTrend.AI
  • Papers
  • Communities
  • Organizations
  • Events
  • Blog
  • Pricing
  • Feedback
  • Contact Sales
Papers
Communities
Social Events
Terms and Conditions
Pricing
Contact Sales
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 1609.04836
  4. Cited By
On Large-Batch Training for Deep Learning: Generalization Gap and Sharp
  Minima
v1v2 (latest)

On Large-Batch Training for Deep Learning: Generalization Gap and Sharp Minima

15 September 2016
N. Keskar
Dheevatsa Mudigere
J. Nocedal
M. Smelyanskiy
P. T. P. Tang
    ODL
ArXiv (abs)PDFHTML

Papers citing "On Large-Batch Training for Deep Learning: Generalization Gap and Sharp Minima"

50 / 1,585 papers shown
Title
VASSO: Variance Suppression for Sharpness-Aware Minimization
Bingcong Li
Yilang Zhang
G. Giannakis
12
0
0
02 Sep 2025
Adaptive Heavy-Tailed Stochastic Gradient Descent
Adaptive Heavy-Tailed Stochastic Gradient Descent
Bodu Gong
Gustavo Enrique Batista
Pierre Lafaye de Micheaux
0
0
0
29 Aug 2025
MERIT: Maximum-normalized Element-wise Ratio for Language Model Large-batch Training
MERIT: Maximum-normalized Element-wise Ratio for Language Model Large-batch Training
Yang Luo
Zangwei Zheng
Ziheng Qin
Zirui Zhu
Yong Liu
Yang You
ALM
0
0
0
28 Aug 2025
Flatness-aware Curriculum Learning via Adversarial Difficulty
Flatness-aware Curriculum Learning via Adversarial Difficulty
Hiroaki Aizawa
Yoshikazu Hayashi
ODL
36
0
0
26 Aug 2025
C-Flat++: Towards a More Efficient and Powerful Framework for Continual Learning
C-Flat++: Towards a More Efficient and Powerful Framework for Continual Learning
Wei Li
Hangjie Yuan
Zixiang Zhao
Yifan Zhu
Aojun Lu
Tao Feng
Yanan Sun
12
0
0
26 Aug 2025
Optimal Sparsity of Mixture-of-Experts Language Models for Reasoning Tasks
Optimal Sparsity of Mixture-of-Experts Language Models for Reasoning Tasks
Taishi Nakamura
Satoki Ishikawa
Masaki Kawamura
Takumi Okamoto
Daisuke Nohara
Jun Suzuki
Rio Yokota
MoELRM
8
0
0
26 Aug 2025
Algebraic Approach to Ridge-Regularized Mean Squared Error Minimization in Minimal ReLU Neural Network
Algebraic Approach to Ridge-Regularized Mean Squared Error Minimization in Minimal ReLU Neural Network
Ryoya Fukasaku
Y. Kabata
Akifumi Okuno
8
0
0
25 Aug 2025
Convergence and Generalization of Anti-Regularization for Parametric Models
Convergence and Generalization of Anti-Regularization for Parametric Models
Dongseok Kim
Wonjun Jeong
Gisung Oh
16
0
0
24 Aug 2025
Balanced Sharpness-Aware Minimization for Imbalanced Regression
Balanced Sharpness-Aware Minimization for Imbalanced Regression
Yahao Liu
Qin Wang
Lixin Duan
Wen Li
8
0
0
23 Aug 2025
WISCA: A Lightweight Model Transition Method to Improve LLM Training via Weight Scaling
WISCA: A Lightweight Model Transition Method to Improve LLM Training via Weight Scaling
Jiacheng Li
Jianchao Tan
Zhidong Yang
Pingwei Sun
Feiye Huo
...
Xiangyu Zhang
Maoxin He
Guangming Tan
Weile Jia
Tong Zhao
8
0
0
21 Aug 2025
Wormhole Dynamics in Deep Neural Networks
Wormhole Dynamics in Deep Neural Networks
Yen-Lung Lai
Zhe Jin
AI4CE
24
0
0
20 Aug 2025
Inter-Class Relational Loss for Small Object Detection: A Case Study on License Plates
Inter-Class Relational Loss for Small Object Detection: A Case Study on License Plates
Dian Ning
Dong Seog Han
28
0
0
20 Aug 2025
Twin-Boot: Uncertainty-Aware Optimization via Online Two-Sample Bootstrapping
Twin-Boot: Uncertainty-Aware Optimization via Online Two-Sample Bootstrapping
Carlos Stein Brito
UQCV
44
0
0
20 Aug 2025
Fisher-Orthogonal Projection Methods for Natural Gradient Descent with Large Batches
Fisher-Orthogonal Projection Methods for Natural Gradient Descent with Large Batches
Yishun Lu
Wesley Armour
ODL
112
0
0
19 Aug 2025
Optimal Condition for Initialization Variance in Deep Neural Networks: An SGD Dynamics Perspective
Optimal Condition for Initialization Variance in Deep Neural Networks: An SGD Dynamics Perspective
Hiroshi Horii
Sothea Has
16
0
0
18 Aug 2025
Training Machine Learning Models on Human Spatio-temporal Mobility Data: An Experimental Study [Experiment Paper]
Training Machine Learning Models on Human Spatio-temporal Mobility Data: An Experimental Study [Experiment Paper]
Yueyang Liu
Lance Kennedy
Ruochen Kong
Joon-Seok Kim
Andreas Züfle
24
0
0
18 Aug 2025
Multi-level Collaborative Distillation Meets Global Workspace Model: A Unified Framework for OCIL
Multi-level Collaborative Distillation Meets Global Workspace Model: A Unified Framework for OCIL
Shibin Su
Guoqiang Liang
De Cheng
Shizhou Zhang
Lingyan Ran
Yanning Zhang
CLL
24
0
0
12 Aug 2025
Statistical Theory of Multi-stage Newton Iteration Algorithm for Online Continual Learning
Statistical Theory of Multi-stage Newton Iteration Algorithm for Online Continual Learning
Xinjia Lu
Chuhan Wang
Qian Zhao
Lixing Zhu
Xuehu Zhu
16
0
0
10 Aug 2025
Tractable Sharpness-Aware Learning of Probabilistic Circuits
Tractable Sharpness-Aware Learning of Probabilistic Circuits
Hrithik Suresh
Sahil Sidheekh
Vishnu Shreeram M.P
S. Natarajan
N. C. Krishnan
TPM
28
0
0
07 Aug 2025
Sensitivity of Stability: Theoretical & Empirical Analysis of Replicability for Adaptive Data Selection in Transfer Learning
Sensitivity of Stability: Theoretical & Empirical Analysis of Replicability for Adaptive Data Selection in Transfer Learning
Prabhav Singh
Jessica Sorrell
24
0
0
06 Aug 2025
Superior resilience to poisoning and amenability to unlearning in quantum machine learning
Superior resilience to poisoning and amenability to unlearning in quantum machine learning
Yu-Qin Chen
Shi-Xin Zhang
AAML
32
1
0
04 Aug 2025
EFlat-LoRA: Efficiently Seeking Flat Minima for Better Generalization in Fine-Tuning Large Language Models and Beyond
EFlat-LoRA: Efficiently Seeking Flat Minima for Better Generalization in Fine-Tuning Large Language Models and Beyond
Jiaxin Deng
Qingcheng Zhu
Junbiao Pang
Linlin Yang
Zhongqian Fu
Baochang Zhang
37
0
0
01 Aug 2025
Communication-Efficient Distributed Training for Collaborative Flat Optima Recovery in Deep Learning
Communication-Efficient Distributed Training for Collaborative Flat Optima Recovery in Deep Learning
Tolga Dimlioglu
A. Choromańska
FedML
36
0
0
27 Jul 2025
Irredundant $k$-Fold Cross-Validation
Irredundant kkk-Fold Cross-Validation
Jesus S. Aguilar-Ruiz
31
0
0
26 Jul 2025
The Price equation reveals a universal force-metric-bias law of algorithmic learning and natural selection
The Price equation reveals a universal force-metric-bias law of algorithmic learning and natural selection
Steven A. Frank
FedML
79
0
0
24 Jul 2025
Large Learning Rates Simultaneously Achieve Robustness to Spurious Correlations and Compressibility
Large Learning Rates Simultaneously Achieve Robustness to Spurious Correlations and Compressibility
Melih Barsbey
Lucas Prieto
Stefanos Zafeiriou
Tolga Birdal
56
0
0
23 Jul 2025
Small Batch Size Training for Language Models: When Vanilla SGD Works, and Why Gradient Accumulation Is Wasteful
Small Batch Size Training for Language Models: When Vanilla SGD Works, and Why Gradient Accumulation Is Wasteful
Martin Marek
Sanae Lotfi
Aditya Somasundaram
A. Wilson
Micah Goldblum
LRM
34
0
0
09 Jul 2025
DGSAM: Domain Generalization via Individual Sharpness-Aware Minimization
DGSAM: Domain Generalization via Individual Sharpness-Aware Minimization
Youngjun Song
Youngsik Hwang
Jonghun Lee
Heechang Lee
Dong-Young Lim
AAML
159
0
0
01 Jul 2025
Single-shot thermometry of simulated Bose--Einstein condensates using artificial intelligence
Single-shot thermometry of simulated Bose--Einstein condensates using artificial intelligence
Jack Griffiths
Steven A. Wrathmall
Simon A. Gardiner
57
0
0
20 Jun 2025
The Butterfly Effect: Neural Network Training Trajectories Are Highly Sensitive to Initial Conditions
The Butterfly Effect: Neural Network Training Trajectories Are Highly Sensitive to Initial Conditions
Devin Kwok
Gül Sena Altıntaş
Colin Raffel
David Rolnick
104
0
0
16 Jun 2025
From Sharpness to Better Generalization for Speech Deepfake Detection
From Sharpness to Better Generalization for Speech Deepfake Detection
Wen-Chin Huang
Xuechen Liu
Xin Eric Wang
Junichi Yamagishi
Yanmin Qian
77
0
0
13 Jun 2025
Generalization Bound of Gradient Flow through Training Trajectory and Data-dependent Kernel
Generalization Bound of Gradient Flow through Training Trajectory and Data-dependent Kernel
Yilan Chen
Zhichao Wang
Wei Huang
Andi Han
Taiji Suzuki
Arya Mazumdar
MLT
77
0
0
12 Jun 2025
FEDTAIL: Federated Long-Tailed Domain Generalization with Sharpness-Guided Gradient Matching
Sunny Gupta
Nikita Jangid
Shounak Das
Amit Sethi
FedML
80
0
0
10 Jun 2025
Promoting Ensemble Diversity with Interactive Bayesian Distributional Robustness for Fine-tuning Foundation Models
Promoting Ensemble Diversity with Interactive Bayesian Distributional Robustness for Fine-tuning Foundation Models
Ngoc-Quan Pham
Tuan Truong
Quyen Tran
T. H. Nguyen
Dinh Q. Phung
T. Le
102
1
0
08 Jun 2025
SAFE: Finding Sparse and Flat Minima to Improve Pruning
SAFE: Finding Sparse and Flat Minima to Improve Pruning
Dongyeop Lee
Kwanhee Lee
Jinseok Chung
Namhoon Lee
99
0
0
07 Jun 2025
Towards Better Generalization via Distributional Input Projection Network
Yifan Hao
Yanxin Lu
Xinwei Shen
Tong Zhang
143
0
0
05 Jun 2025
Temporal horizons in forecasting: a performance-learnability trade-off
Temporal horizons in forecasting: a performance-learnability trade-off
Pau Vilimelis Aceituno
Jack William Miller
Noah Marti
Youssef Farag
Victor Boussange
AI4TS
184
0
0
04 Jun 2025
scDataset: Scalable Data Loading for Deep Learning on Large-Scale Single-Cell Omics
scDataset: Scalable Data Loading for Deep Learning on Large-Scale Single-Cell Omics
Davide DÁscenzo
Sebastiano Cultrera di Montesano
138
0
0
02 Jun 2025
GradPower: Powering Gradients for Faster Language Model Pre-Training
GradPower: Powering Gradients for Faster Language Model Pre-Training
Mingze Wang
Jinbo Wang
Jiaqi Zhang
Wei Wang
Peng Pei
Xunliang Cai
Weinan E
Lei Wu
109
0
0
30 May 2025
LightSAM: Parameter-Agnostic Sharpness-Aware Minimization
LightSAM: Parameter-Agnostic Sharpness-Aware Minimization
Yifei Cheng
Li Shen
Hao Sun
Nan Yin
Xiaochun Cao
Enhong Chen
AAML
86
0
0
30 May 2025
Towards Understanding The Calibration Benefits of Sharpness-Aware Minimization
Towards Understanding The Calibration Benefits of Sharpness-Aware Minimization
C. Tan
Yubo Zhou
Haishan Ye
Guang Dai
Junmin Liu
Zengjie Song
Jiangshe Zhang
Zixiang Zhao
Yunda Hao
Yong Xu
AAML
92
0
0
29 May 2025
Dynamic Spectral Backpropagation for Efficient Neural Network Training
Dynamic Spectral Backpropagation for Efficient Neural Network Training
Mannmohan Muthuraman
198
0
0
29 May 2025
One-Time Soft Alignment Enables Resilient Learning without Weight Transport
One-Time Soft Alignment Enables Resilient Learning without Weight Transport
Jeonghwan Cheon
Jaehyuk Bae
Se-Bum Paik
ODL
117
2
0
27 May 2025
Skip-Thinking: Chunk-wise Chain-of-Thought Distillation Enable Smaller Language Models to Reason Better and Faster
Skip-Thinking: Chunk-wise Chain-of-Thought Distillation Enable Smaller Language Models to Reason Better and Faster
Xiao Chen
Sihang Zhou
K. Liang
Xiaoyu Sun
Xinwang Liu
LRM
91
3
0
24 May 2025
Convergence, Sticking and Escape: Stochastic Dynamics Near Critical Points in SGD
Convergence, Sticking and Escape: Stochastic Dynamics Near Critical Points in SGD
Dmitry Dudukalov
Artem Logachov
Vladimir Lotov
Timofei Prasolov
Evgeny Prokopenko
Anton Tarasenko
101
0
0
24 May 2025
TRACE for Tracking the Emergence of Semantic Representations in Transformers
TRACE for Tracking the Emergence of Semantic Representations in Transformers
Nura Aljaafari
Danilo S. Carvalho
André Freitas
124
0
0
23 May 2025
Accidental Vulnerability: Factors in Fine-Tuning that Shift Model Safeguards
Accidental Vulnerability: Factors in Fine-Tuning that Shift Model Safeguards
Punya Syon Pandey
Samuel Simko
Kellin Pelrine
Zhijing Jin
AAML
108
2
0
22 May 2025
Revealing Language Model Trajectories via Kullback-Leibler Divergence
Revealing Language Model Trajectories via Kullback-Leibler Divergence
Ryo Kishino
Yusuke Takase
Momose Oyama
Hiroaki Yamagiwa
Hidetoshi Shimodaira
129
0
0
21 May 2025
DeepKD: A Deeply Decoupled and Denoised Knowledge Distillation Trainer
DeepKD: A Deeply Decoupled and Denoised Knowledge Distillation Trainer
Haiduo Huang
Jiangcheng Song
Yadong Zhang
Pengju Ren
112
0
0
21 May 2025
Intra-class Patch Swap for Self-Distillation
Intra-class Patch Swap for Self-Distillation
Hongjun Choi
Eun Som Jeon
Ankita Shukla
Pavan Turaga
99
0
0
20 May 2025
1234...303132
Next