Papers
Communities
Events
Blog
Pricing
Search
Open menu
Home
Papers
2212.14034
Cited By
Cramming: Training a Language Model on a Single GPU in One Day
28 December 2022
Jonas Geiping
Tom Goldstein
MoE
Re-assign community
ArXiv
PDF
HTML
Papers citing
"Cramming: Training a Language Model on a Single GPU in One Day"
29 / 29 papers shown
Title
Efficient Self-Supervised Learning for Earth Observation via Dynamic Dataset Curation
Thomas Kerdreux
A. Tuel
Quentin Febvre
A. Mouche
Bertrand Chapron
73
0
0
09 Apr 2025
Hysteresis Activation Function for Efficient Inference
Moshe Kimhi
Idan Kashani
A. Mendelson
Chaim Baskin
LLMSV
21
0
0
15 Nov 2024
ReLU's Revival: On the Entropic Overload in Normalization-Free Large Language Models
N. Jha
Brandon Reagen
OffRL
AI4CE
22
0
0
12 Oct 2024
Diffusion Guided Language Modeling
Justin Lovelace
Varsha Kishore
Yiwei Chen
Kilian Q. Weinberger
25
6
0
08 Aug 2024
An Adaptive Stochastic Gradient Method with Non-negative Gauss-Newton Stepsizes
Antonio Orvieto
Lin Xiao
32
2
0
05 Jul 2024
Look Ahead or Look Around? A Theoretical Comparison Between Autoregressive and Masked Pretraining
Qi Zhang
Tianqi Du
Haotian Huang
Yifei Wang
Yisen Wang
19
3
0
01 Jul 2024
Sustainable self-supervised learning for speech representations
Luis Lugo
Valentin Vielzeuf
26
2
0
11 Jun 2024
Symmetric Dot-Product Attention for Efficient Training of BERT Language Models
Martin Courtois
Malte Ostendorff
Leonhard Hennig
Georg Rehm
23
2
0
10 Jun 2024
Improving the Capabilities of Large Language Model Based Marketing Analytics Copilots With Semantic Search And Fine-Tuning
Yilin Gao
Arava Sai Kumar
Yancheng Li
James W. Snyder
AI4MH
27
2
0
16 Apr 2024
Setting the Record Straight on Transformer Oversmoothing
G. Dovonon
M. Bronstein
Matt J. Kusner
17
5
0
09 Jan 2024
Exponentially Faster Language Modelling
Peter Belcak
Roger Wattenhofer
6
2
0
15 Nov 2023
UMLS-KGI-BERT: Data-Centric Knowledge Integration in Transformers for Biomedical Entity Recognition
Aidan Mannion
Thierry Chevalier
D. Schwab
Lorraine Goeuriot
MedIm
32
3
0
20 Jul 2023
No Train No Gain: Revisiting Efficient Training Algorithms For Transformer-based Language Models
Jean Kaddour
Oscar Key
Piotr Nawrot
Pasquale Minervini
Matt J. Kusner
13
41
0
12 Jul 2023
The MiniPile Challenge for Data-Efficient Language Models
Jean Kaddour
MoE
ALM
8
41
0
17 Apr 2023
On Efficient Training of Large-Scale Deep Learning Models: A Literature Review
Li Shen
Yan Sun
Zhiyuan Yu
Liang Ding
Xinmei Tian
Dacheng Tao
VLM
19
39
0
07 Apr 2023
What Language Model to Train if You Have One Million GPU Hours?
Teven Le Scao
Thomas Wang
Daniel Hesslow
Lucile Saulnier
Stas Bekman
...
Lintang Sutawika
Jaesung Tae
Zheng-Xin Yong
Julien Launay
Iz Beltagy
MoE
AI4CE
212
103
0
27 Oct 2022
Simple Recurrence Improves Masked Language Models
Tao Lei
Ran Tian
Jasmijn Bastings
Ankur P. Parikh
36
4
0
23 May 2022
Transformer Quality in Linear Time
Weizhe Hua
Zihang Dai
Hanxiao Liu
Quoc V. Le
71
220
0
21 Feb 2022
Sharpness-Aware Minimization Improves Language Model Generalization
Dara Bahri
H. Mobahi
Yi Tay
119
82
0
16 Oct 2021
Primer: Searching for Efficient Transformers for Language Modeling
David R. So
Wojciech Mañke
Hanxiao Liu
Zihang Dai
Noam M. Shazeer
Quoc V. Le
VLM
83
149
0
17 Sep 2021
Deduplicating Training Data Makes Language Models Better
Katherine Lee
Daphne Ippolito
A. Nystrom
Chiyuan Zhang
Douglas Eck
Chris Callison-Burch
Nicholas Carlini
SyDa
234
588
0
14 Jul 2021
ZeRO-Offload: Democratizing Billion-Scale Model Training
Jie Ren
Samyam Rajbhandari
Reza Yazdani Aminabadi
Olatunji Ruwase
Shuangyang Yang
Minjia Zhang
Dong Li
Yuxiong He
MoE
157
399
0
18 Jan 2021
The Pile: An 800GB Dataset of Diverse Text for Language Modeling
Leo Gao
Stella Biderman
Sid Black
Laurence Golding
Travis Hoppe
...
Horace He
Anish Thite
Noa Nabeshima
Shawn Presser
Connor Leahy
AIMat
242
1,977
0
31 Dec 2020
Rethinking embedding coupling in pre-trained language models
Hyung Won Chung
Thibault Févry
Henry Tsai
Melvin Johnson
Sebastian Ruder
93
142
0
24 Oct 2020
On the Transformer Growth for Progressive BERT Training
Xiaotao Gu
Liyuan Liu
Hongkun Yu
Jing Li
C. L. P. Chen
Jiawei Han
VLM
61
49
0
23 Oct 2020
Big Bird: Transformers for Longer Sequences
Manzil Zaheer
Guru Guruganesh
Kumar Avinava Dubey
Joshua Ainslie
Chris Alberti
...
Philip Pham
Anirudh Ravula
Qifan Wang
Li Yang
Amr Ahmed
VLM
249
1,982
0
28 Jul 2020
Scaling Laws for Neural Language Models
Jared Kaplan
Sam McCandlish
T. Henighan
Tom B. Brown
B. Chess
R. Child
Scott Gray
Alec Radford
Jeff Wu
Dario Amodei
220
3,054
0
23 Jan 2020
GLUE: A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding
Alex Jinpeng Wang
Amanpreet Singh
Julian Michael
Felix Hill
Omer Levy
Samuel R. Bowman
ELM
294
6,927
0
20 Apr 2018
Google's Neural Machine Translation System: Bridging the Gap between Human and Machine Translation
Yonghui Wu
M. Schuster
Z. Chen
Quoc V. Le
Mohammad Norouzi
...
Alex Rudnick
Oriol Vinyals
G. Corrado
Macduff Hughes
J. Dean
AIMat
716
6,435
0
26 Sep 2016
1