ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 1904.00962
  4. Cited By
Large Batch Optimization for Deep Learning: Training BERT in 76 minutes

Large Batch Optimization for Deep Learning: Training BERT in 76 minutes

1 April 2019
Yang You
Jing Li
Sashank J. Reddi
Jonathan Hseu
Sanjiv Kumar
Srinadh Bhojanapalli
Xiaodan Song
J. Demmel
Kurt Keutzer
Cho-Jui Hsieh
    ODL
ArXivPDFHTML

Papers citing "Large Batch Optimization for Deep Learning: Training BERT in 76 minutes"

50 / 170 papers shown
Title
Tensor Programs V: Tuning Large Neural Networks via Zero-Shot
  Hyperparameter Transfer
Tensor Programs V: Tuning Large Neural Networks via Zero-Shot Hyperparameter Transfer
Greg Yang
J. E. Hu
Igor Babuschkin
Szymon Sidor
Xiaodong Liu
David Farhi
Nick Ryder
J. Pachocki
Weizhu Chen
Jianfeng Gao
24
148
0
07 Mar 2022
SimKGC: Simple Contrastive Knowledge Graph Completion with Pre-trained
  Language Models
SimKGC: Simple Contrastive Knowledge Graph Completion with Pre-trained Language Models
Liang Wang
Wei-Ye Zhao
Zhuoyu Wei
Jingming Liu
28
178
0
04 Mar 2022
FastFold: Reducing AlphaFold Training Time from 11 Days to 67 Hours
FastFold: Reducing AlphaFold Training Time from 11 Days to 67 Hours
Shenggan Cheng
Xuanlei Zhao
Guangyang Lu
Bin-Rui Li
Zhongming Yu
Tian Zheng
R. Wu
Xiwen Zhang
Jian Peng
Yang You
AI4CE
17
30
0
02 Mar 2022
Wukong: A 100 Million Large-scale Chinese Cross-modal Pre-training
  Benchmark
Wukong: A 100 Million Large-scale Chinese Cross-modal Pre-training Benchmark
Jiaxi Gu
Xiaojun Meng
Guansong Lu
Lu Hou
Minzhe Niu
...
Runhu Huang
Wei Zhang
Xingda Jiang
Chunjing Xu
Hang Xu
VLM
35
86
0
14 Feb 2022
Optimal Algorithms for Decentralized Stochastic Variational Inequalities
Optimal Algorithms for Decentralized Stochastic Variational Inequalities
D. Kovalev
Aleksandr Beznosikov
Abdurakhmon Sadiev
Michael Persiianov
Peter Richtárik
Alexander Gasnikov
33
34
0
06 Feb 2022
Robust Training of Neural Networks Using Scale Invariant Architectures
Robust Training of Neural Networks Using Scale Invariant Architectures
Zhiyuan Li
Srinadh Bhojanapalli
Manzil Zaheer
Sashank J. Reddi
Surinder Kumar
19
27
0
02 Feb 2022
Deep Learning Methods for Abstract Visual Reasoning: A Survey on Raven's
  Progressive Matrices
Deep Learning Methods for Abstract Visual Reasoning: A Survey on Raven's Progressive Matrices
Mikolaj Malkiñski
Jacek Mañdziuk
117
41
0
28 Jan 2022
One Student Knows All Experts Know: From Sparse to Dense
One Student Knows All Experts Know: From Sparse to Dense
Fuzhao Xue
Xiaoxin He
Xiaozhe Ren
Yuxuan Lou
Yang You
MoMe
MoE
24
20
0
26 Jan 2022
Towards Controllable Agent in MOBA Games with Generative Modeling
Towards Controllable Agent in MOBA Games with Generative Modeling
Shubao Zhang
30
0
0
15 Dec 2021
Injecting Semantic Concepts into End-to-End Image Captioning
Injecting Semantic Concepts into End-to-End Image Captioning
Zhiyuan Fang
Jianfeng Wang
Xiaowei Hu
Lin Liang
Zhe Gan
Lijuan Wang
Yezhou Yang
Zicheng Liu
ViT
VLM
19
86
0
09 Dec 2021
Improving language models by retrieving from trillions of tokens
Improving language models by retrieving from trillions of tokens
Sebastian Borgeaud
A. Mensch
Jordan Hoffmann
Trevor Cai
Eliza Rutherford
...
Simon Osindero
Karen Simonyan
Jack W. Rae
Erich Elsen
Laurent Sifre
KELM
RALM
33
1,013
0
08 Dec 2021
Boosting Discriminative Visual Representation Learning with
  Scenario-Agnostic Mixup
Boosting Discriminative Visual Representation Learning with Scenario-Agnostic Mixup
Siyuan Li
Zicheng Liu
Zedong Wang
Di Wu
Zihan Liu
Stan Z. Li
12
26
0
30 Nov 2021
FILIP: Fine-grained Interactive Language-Image Pre-Training
FILIP: Fine-grained Interactive Language-Image Pre-Training
Lewei Yao
Runhu Huang
Lu Hou
Guansong Lu
Minzhe Niu
Hang Xu
Xiaodan Liang
Zhenguo Li
Xin Jiang
Chunjing Xu
VLM
CLIP
28
612
0
09 Nov 2021
Large-Scale Deep Learning Optimizations: A Comprehensive Survey
Large-Scale Deep Learning Optimizations: A Comprehensive Survey
Xiaoxin He
Fuzhao Xue
Xiaozhe Ren
Yang You
22
14
0
01 Nov 2021
MLPerf HPC: A Holistic Benchmark Suite for Scientific Machine Learning
  on HPC Systems
MLPerf HPC: A Holistic Benchmark Suite for Scientific Machine Learning on HPC Systems
S. Farrell
M. Emani
J. Balma
L. Drescher
Aleksandr Drozd
...
Akihiro Tabuchi
V. Vishwanath
M. Wahib
Masafumi Yamazaki
Junqi Yin
VLM
21
35
0
21 Oct 2021
Dual Encoding U-Net for Spatio-Temporal Domain Shift Frame Prediction
Dual Encoding U-Net for Spatio-Temporal Domain Shift Frame Prediction
Jay Santokhi
Dylan Hillier
Yiming Yang
Joned Sarwar
A. Jordán
Emil Hewage
AI4CE
20
1
0
21 Oct 2021
bert2BERT: Towards Reusable Pretrained Language Models
bert2BERT: Towards Reusable Pretrained Language Models
Cheng Chen
Yichun Yin
Lifeng Shang
Xin Jiang
Yujia Qin
Fengyu Wang
Zhi Wang
Xiao Chen
Zhiyuan Liu
Qun Liu
VLM
22
59
0
14 Oct 2021
Ab-Initio Potential Energy Surfaces by Pairing GNNs with Neural Wave
  Functions
Ab-Initio Potential Energy Surfaces by Pairing GNNs with Neural Wave Functions
Nicholas Gao
Stephan Günnemann
19
36
0
11 Oct 2021
Speeding up Deep Model Training by Sharing Weights and Then Unsharing
Speeding up Deep Model Training by Sharing Weights and Then Unsharing
Shuo Yang
Le Hou
Xiaodan Song
Qiang Liu
Denny Zhou
110
9
0
08 Oct 2021
EF21 with Bells & Whistles: Practical Algorithmic Extensions of Modern
  Error Feedback
EF21 with Bells & Whistles: Practical Algorithmic Extensions of Modern Error Feedback
Ilyas Fatkhullin
Igor Sokolov
Eduard A. Gorbunov
Zhize Li
Peter Richtárik
44
44
0
07 Oct 2021
ResNet strikes back: An improved training procedure in timm
ResNet strikes back: An improved training procedure in timm
Ross Wightman
Hugo Touvron
Hervé Jégou
AI4TS
207
487
0
01 Oct 2021
AdaInject: Injection Based Adaptive Gradient Descent Optimizers for
  Convolutional Neural Networks
AdaInject: Injection Based Adaptive Gradient Descent Optimizers for Convolutional Neural Networks
S. Dubey
S. H. Shabbeer Basha
S. Singh
B. B. Chaudhuri
ODL
35
9
0
26 Sep 2021
Learning the Physics of Particle Transport via Transformers
Learning the Physics of Particle Transport via Transformers
O. Pastor-Serrano
Zoltán Perkó
MedIm
13
13
0
08 Sep 2021
PAIR: Leveraging Passage-Centric Similarity Relation for Improving Dense
  Passage Retrieval
PAIR: Leveraging Passage-Centric Similarity Relation for Improving Dense Passage Retrieval
Ruiyang Ren
Shangwen Lv
Yingqi Qu
Jing Liu
Wayne Xin Zhao
Qiaoqiao She
Hua-Hong Wu
Haifeng Wang
Ji-Rong Wen
118
90
0
13 Aug 2021
Logit Attenuating Weight Normalization
Logit Attenuating Weight Normalization
Aman Gupta
R. Ramanath
Jun Shi
Anika Ramachandran
Sirou Zhou
Mingzhou Zhou
S. Keerthi
30
1
0
12 Aug 2021
Online Evolutionary Batch Size Orchestration for Scheduling Deep
  Learning Workloads in GPU Clusters
Online Evolutionary Batch Size Orchestration for Scheduling Deep Learning Workloads in GPU Clusters
Chen Sun
Shenggui Li
Jinyue Wang
Jun Yu
50
47
0
08 Aug 2021
Large-Scale Differentially Private BERT
Large-Scale Differentially Private BERT
Rohan Anil
Badih Ghazi
Vineet Gupta
Ravi Kumar
Pasin Manurangsi
22
131
0
03 Aug 2021
LICHEE: Improving Language Model Pre-training with Multi-grained
  Tokenization
LICHEE: Improving Language Model Pre-training with Multi-grained Tokenization
Weidong Guo
Mingjun Zhao
Lusheng Zhang
Di Niu
Jinwen Luo
Zhenhua Liu
Zhenyang Li
J. Tang
13
8
0
02 Aug 2021
Pointer Value Retrieval: A new benchmark for understanding the limits of
  neural network generalization
Pointer Value Retrieval: A new benchmark for understanding the limits of neural network generalization
Chiyuan Zhang
M. Raghu
Jon M. Kleinberg
Samy Bengio
OOD
19
30
0
27 Jul 2021
Go Wider Instead of Deeper
Go Wider Instead of Deeper
Fuzhao Xue
Ziji Shi
Futao Wei
Yuxuan Lou
Yong Liu
Yang You
ViT
MoE
14
80
0
25 Jul 2021
Chimera: Efficiently Training Large-Scale Neural Networks with
  Bidirectional Pipelines
Chimera: Efficiently Training Large-Scale Neural Networks with Bidirectional Pipelines
Shigang Li
Torsten Hoefler
GNN
AI4CE
LRM
77
131
0
14 Jul 2021
Accelerating Distributed K-FAC with Smart Parallelism of Computing and
  Communication Tasks
Accelerating Distributed K-FAC with Smart Parallelism of Computing and Communication Tasks
S. Shi
Lin Zhang
Bo-wen Li
24
9
0
14 Jul 2021
ResIST: Layer-Wise Decomposition of ResNets for Distributed Training
ResIST: Layer-Wise Decomposition of ResNets for Distributed Training
Chen Dun
Cameron R. Wolfe
C. Jermaine
Anastasios Kyrillidis
16
21
0
02 Jul 2021
What can linear interpolation of neural network loss landscapes tell us?
What can linear interpolation of neural network loss landscapes tell us?
Tiffany J. Vlaar
Jonathan Frankle
MoMe
14
27
0
30 Jun 2021
Large-Scale Chemical Language Representations Capture Molecular
  Structure and Properties
Large-Scale Chemical Language Representations Capture Molecular Structure and Properties
Jerret Ross
Brian M. Belgodere
Vijil Chenthamarakshan
Inkit Padhi
Youssef Mroueh
Payel Das
AI4CE
19
271
0
17 Jun 2021
On Large-Cohort Training for Federated Learning
On Large-Cohort Training for Federated Learning
Zachary B. Charles
Zachary Garrett
Zhouyuan Huo
Sergei Shmulyian
Virginia Smith
FedML
16
112
0
15 Jun 2021
Pre-Trained Models: Past, Present and Future
Pre-Trained Models: Past, Present and Future
Xu Han
Zhengyan Zhang
Ning Ding
Yuxian Gu
Xiao Liu
...
Jie Tang
Ji-Rong Wen
Jinhui Yuan
Wayne Xin Zhao
Jun Zhu
AIFin
MQ
AI4MH
24
811
0
14 Jun 2021
Federated Learning with Buffered Asynchronous Aggregation
Federated Learning with Buffered Asynchronous Aggregation
John Nguyen
Kshitiz Malik
Hongyuan Zhan
Ashkan Yousefpour
Michael G. Rabbat
Mani Malek
Dzmitry Huba
FedML
11
287
0
11 Jun 2021
ResMLP: Feedforward networks for image classification with
  data-efficient training
ResMLP: Feedforward networks for image classification with data-efficient training
Hugo Touvron
Piotr Bojanowski
Mathilde Caron
Matthieu Cord
Alaaeldin El-Nouby
...
Gautier Izacard
Armand Joulin
Gabriel Synnaeve
Jakob Verbeek
Hervé Jégou
VLM
16
654
0
07 May 2021
How to Train BERT with an Academic Budget
How to Train BERT with an Academic Budget
Peter Izsak
Moshe Berchansky
Omer Levy
12
111
0
15 Apr 2021
Large Batch Simulation for Deep Reinforcement Learning
Large Batch Simulation for Deep Reinforcement Learning
Brennan Shacklett
Erik Wijmans
Aleksei Petrenko
Manolis Savva
Dhruv Batra
V. Koltun
Kayvon Fatahalian
3DV
OffRL
AI4CE
27
26
0
12 Mar 2021
CANINE: Pre-training an Efficient Tokenization-Free Encoder for Language
  Representation
CANINE: Pre-training an Efficient Tokenization-Free Encoder for Language Representation
J. Clark
Dan Garrette
Iulia Turc
John Wieting
25
210
0
11 Mar 2021
Moshpit SGD: Communication-Efficient Decentralized Training on
  Heterogeneous Unreliable Devices
Moshpit SGD: Communication-Efficient Decentralized Training on Heterogeneous Unreliable Devices
Max Ryabinin
Eduard A. Gorbunov
Vsevolod Plokhotnyuk
Gennady Pekhimenko
21
31
0
04 Mar 2021
Perceiver: General Perception with Iterative Attention
Perceiver: General Perception with Iterative Attention
Andrew Jaegle
Felix Gimeno
Andrew Brock
Andrew Zisserman
Oriol Vinyals
João Carreira
VLM
ViT
MDE
48
973
0
04 Mar 2021
Lost in Pruning: The Effects of Pruning Neural Networks beyond Test
  Accuracy
Lost in Pruning: The Effects of Pruning Neural Networks beyond Test Accuracy
Lucas Liebenwein
Cenk Baykal
Brandon Carter
David K Gifford
Daniela Rus
AAML
27
71
0
04 Mar 2021
MARINA: Faster Non-Convex Distributed Learning with Compression
MARINA: Faster Non-Convex Distributed Learning with Compression
Eduard A. Gorbunov
Konstantin Burlachenko
Zhize Li
Peter Richtárik
22
108
0
15 Feb 2021
Optimizing Inference Performance of Transformers on CPUs
Optimizing Inference Performance of Transformers on CPUs
D. Dice
Alex Kogan
19
15
0
12 Feb 2021
High-Performance Large-Scale Image Recognition Without Normalization
High-Performance Large-Scale Image Recognition Without Normalization
Andrew Brock
Soham De
Samuel L. Smith
Karen Simonyan
VLM
223
512
0
11 Feb 2021
Decoupling the Role of Data, Attention, and Losses in Multimodal
  Transformers
Decoupling the Role of Data, Attention, and Losses in Multimodal Transformers
Lisa Anne Hendricks
John F. J. Mellor
R. Schneider
Jean-Baptiste Alayrac
Aida Nematzadeh
75
110
0
31 Jan 2021
AraGPT2: Pre-Trained Transformer for Arabic Language Generation
AraGPT2: Pre-Trained Transformer for Arabic Language Generation
Wissam Antoun
Fady Baly
Hazem M. Hajj
VLM
14
103
0
31 Dec 2020
Previous
1234
Next