Communities
Connect sessions
AI calendar
Organizations
Join Slack
Contact Sales
Search
Open menu
Home
Papers
All Papers
0 / 0 papers shown
Title
Home
Papers
2104.07705
Cited By
v1
v2 (latest)
How to Train BERT with an Academic Budget
Conference on Empirical Methods in Natural Language Processing (EMNLP), 2021
15 April 2021
Peter Izsak
Moshe Berchansky
Omer Levy
Re-assign community
ArXiv (abs)
PDF
HTML
HuggingFace (2 upvotes)
Papers citing
"How to Train BERT with an Academic Budget"
50 / 71 papers shown
Title
What is the Best Sequence Length for BABYLM?
Suchir Salhan
Richard Diehl Martinez
Zébulon Goriely
P. Buttery
96
1
0
22 Oct 2025
Patent Language Model Pretraining with ModernBERT
Amirhossein Yousefiramandi
Ciaran Cooney
AILaw
VLM
293
2
0
18 Sep 2025
Stepsize anything: A unified learning rate schedule for budgeted-iteration training
Anda Tang
Yiming Dong
Yutao Zeng
zhou Xun
Zhouchen Lin
603
1
0
30 May 2025
MaxPoolBERT: Enhancing BERT Classification via Layer- and Token-Wise Aggregation
Maike Behrendt
Stefan Sylvius Wagner
Stefan Harmeling
SSeg
508
1
0
21 May 2025
Prediction-powered estimators for finite population statistics in highly imbalanced textual data: Public hate crime estimation
Hannes Waldetoft
Jakob Torgander
Måns Magnusson
226
2
0
05 May 2025
Findings of the BabyLM Challenge: Sample-Efficient Pretraining on Developmentally Plausible Corpora
Alex Warstadt
Aaron Mueller
Leshem Choshen
E. Wilcox
Chengxu Zhuang
...
Rafael Mosquera
Bhargavi Paranjape
Adina Williams
Tal Linzen
Robert Bamler
596
166
0
10 Apr 2025
Slamming: Training a Speech Language Model on One GPU in a Day
Annual Meeting of the Association for Computational Linguistics (ACL), 2025
Gallil Maimon
Avishai Elmakies
Yossi Adi
313
9
0
19 Feb 2025
A distributional simplicity bias in the learning dynamics of transformers
Neural Information Processing Systems (NeurIPS), 2024
Riccardo Rende
Federica Gerace
Alessandro Laio
Sebastian Goldt
397
14
0
17 Feb 2025
Smarter, Better, Faster, Longer: A Modern Bidirectional Encoder for Fast, Memory Efficient, and Long Context Finetuning and Inference
Benjamin Warner
Antoine Chaffin
Benjamin Clavié
Orion Weller
Oskar Hallström
...
Tom Aarsen
Nathan Cooper
Griffin Adams
Jeremy Howard
Iacopo Poli
449
372
0
18 Dec 2024
Normalization Layer Per-Example Gradients are Sufficient to Predict Gradient Noise Scale in Transformers
Neural Information Processing Systems (NeurIPS), 2024
Gavia Gray
Aman Tiwari
Shane Bergsma
Joel Hestness
353
2
0
01 Nov 2024
$100K or 100 Days: Trade-offs when Pre-Training with Academic Resources
Apoorv Khandelwal
Tian Yun
Nihal V. Nayak
Jack Merullo
Stephen H. Bach
Chen Sun
Ellie Pavlick
VLM
AI4CE
OnRL
270
6
0
30 Oct 2024
Team Ryu's Submission to SIGMORPHON 2024 Shared Task on Subword Tokenization
Zilong Li
199
0
0
19 Oct 2024
Exploring the Benefit of Activation Sparsity in Pre-training
International Conference on Machine Learning (ICML), 2024
Zhengyan Zhang
Chaojun Xiao
Qiujieli Qin
Yankai Lin
Zhiyuan Zeng
Xu Han
Zhiyuan Liu
Ruobing Xie
Maosong Sun
Jie Zhou
MoE
223
6
0
04 Oct 2024
Expanding Expressivity in Transformer Models with MöbiusAttention
Anna-Maria Halacheva
M. Nayyeri
Steffen Staab
199
1
0
08 Sep 2024
Hydra: Bidirectional State Space Models Through Generalized Matrix Mixers
Sukjun Hwang
Aakash Lahoti
Tri Dao
Albert Gu
Mamba
317
37
0
13 Jul 2024
Aligning Programming Language and Natural Language: Exploring Design Choices in Multi-Modal Transformer-Based Embedding for Bug Localization
Partha Chakraborty
Venkatraman Arumugam
M. Nagappan
172
0
0
25 Jun 2024
Knowledge Distillation vs. Pretraining from Scratch under a Fixed (Computation) Budget
Minh Duc Bui
Fabian David Schmidt
Goran Glavaš
Katharina von der Wense
169
1
0
30 Apr 2024
PeLLE: Encoder-based language models for Brazilian Portuguese based on open data
Guilherme Lamartine de Mello
Marcelo Finger
F. Serras
M. Carpi
Marcos Menon Jose
Pedro Henrique Domingues
Paulo Cavalim
267
1
0
29 Feb 2024
Orchid: Flexible and Data-Dependent Convolution for Sequence Modeling
Mahdi Karami
Ali Ghodsi
VLM
344
8
0
28 Feb 2024
The Effect of Batch Size on Contrastive Self-Supervised Speech Representation Learning
Nik Vaessen
David A. van Leeuwen
300
5
0
21 Feb 2024
The Compute Divide in Machine Learning: A Threat to Academic Contribution and Scrutiny?
T. Besiroglu
S. Bergerson
Amelia Michael
Lennart Heim
Xueyun Luo
Neil Thompson
234
21
0
04 Jan 2024
MosaicBERT: A Bidirectional Encoder Optimized for Fast Pretraining
Neural Information Processing Systems (NeurIPS), 2023
Jacob P. Portes
Alex Trott
Sam Havens
Daniel King
Abhinav Venigalla
Moin Nadeem
Nikhil Sardana
D. Khudia
Jonathan Frankle
293
32
0
29 Dec 2023
Spike No More: Stabilizing the Pre-training of Large Language Models
Sho Takase
Shun Kiyono
Sosuke Kobayashi
Jun Suzuki
402
26
0
28 Dec 2023
Accelerated Convergence of Stochastic Heavy Ball Method under Anisotropic Gradient Noise
Boyao Wang
Yuxing Liu
Xiaoyu Wang
Tong Zhang
210
7
0
22 Dec 2023
CLIMB: Curriculum Learning for Infant-inspired Model Building
Richard Diehl Martinez
Zébulon Goriely
Hope McGovern
Christopher Davis
Andrew Caines
P. Buttery
Lisa Beinborn
203
20
0
15 Nov 2023
Explicit Morphological Knowledge Improves Pre-training of Language Models for Hebrew
Eylon Gueta
Omer Goldman
Reut Tsarfaty
151
3
0
01 Nov 2023
A Quadratic Synchronization Rule for Distributed Deep Learning
International Conference on Learning Representations (ICLR), 2023
Xinran Gu
Kaifeng Lyu
Sanjeev Arora
Jingzhao Zhang
Longbo Huang
297
4
0
22 Oct 2023
A Simple and Robust Framework for Cross-Modality Medical Image Segmentation applied to Vision Transformers
Matteo Bastico
David Ryckelynck
Laurent Corté
Yannick Tillier
Etienne Decencière
MedIm
ViT
191
4
0
09 Oct 2023
M-FLAG: Medical Vision-Language Pre-training with Frozen Language Models and Latent Space Geometry Optimization
International Conference on Medical Image Computing and Computer-Assisted Intervention (MICCAI), 2023
Che Liu
Sibo Cheng
Chong Chen
Mengyun Qiao
Weitong Zhang
Anand Shah
Wenjia Bai
Rossella Arcucci
VLM
396
71
0
17 Jul 2023
No Train No Gain: Revisiting Efficient Training Algorithms For Transformer-based Language Models
Neural Information Processing Systems (NeurIPS), 2023
Jean Kaddour
Oscar Key
Piotr Nawrot
Pasquale Minervini
Matt J. Kusner
406
58
0
12 Jul 2023
Biomedical Language Models are Robust to Sub-optimal Tokenization
Workshop on Biomedical Natural Language Processing (BioNLP), 2023
Bernal Jiménez Gutiérrez
Huan Sun
Yu-Chuan Su
183
8
0
30 Jun 2023
Surveying (Dis)Parities and Concerns of Compute Hungry NLP Research
Ji-Ung Lee
Haritz Puerto
Betty van Aken
Yuki Arase
Jessica Zosa Forde
...
Andreas Rucklé
Iryna Gurevych
Roy Schwartz
Emma Strubell
Jesse Dodge
260
7
0
29 Jun 2023
Lost in Translation: Large Language Models in Non-English Content Analysis
Gabriel Nicholas
Aliya Bhatia
ELM
265
54
0
12 Jun 2023
Mixture-of-Supernets: Improving Weight-Sharing Supernet Training with Architecture-Routed Mixture-of-Experts
Annual Meeting of the Association for Computational Linguistics (ACL), 2023
Ganesh Jawahar
Haichuan Yang
Yunyang Xiong
Zechun Liu
Dilin Wang
...
Barlas Oğuz
Muhammad Abdul-Mageed
L. Lakshmanan
Raghuraman Krishnamoorthi
Vikas Chandra
183
6
0
08 Jun 2023
Data-Efficient French Language Modeling with CamemBERTa
Annual Meeting of the Association for Computational Linguistics (ACL), 2023
Wissam Antoun
Benoît Sagot
Djamé Seddah
147
8
0
02 Jun 2023
How to Distill your BERT: An Empirical Study on the Impact of Weight Initialisation and Distillation Objectives
Annual Meeting of the Association for Computational Linguistics (ACL), 2023
Xinpeng Wang
Leonie Weissweiler
Hinrich Schütze
Barbara Plank
114
11
0
24 May 2023
Sophia: A Scalable Stochastic Second-order Optimizer for Language Model Pre-training
International Conference on Learning Representations (ICLR), 2023
Hong Liu
Zhiyuan Li
David Leo Wright Hall
Abigail Z. Jacobs
Tengyu Ma
VLM
582
218
0
23 May 2023
Cuttlefish: Low-Rank Model Training without All the Tuning
Conference on Machine Learning and Systems (MLSys), 2023
Hongyi Wang
Saurabh Agarwal
Pongsakorn U-chupala
Yoshiki Tanaka
Eric P. Xing
Dimitris Papailiopoulos
OffRL
296
26
0
04 May 2023
Distill or Annotate? Cost-Efficient Fine-Tuning of Compact Models
Annual Meeting of the Association for Computational Linguistics (ACL), 2023
Junmo Kang
Wei Xu
Alan Ritter
296
16
0
02 May 2023
The MiniPile Challenge for Data-Efficient Language Models
Jean Kaddour
MoE
ALM
320
63
0
17 Apr 2023
On Efficient Training of Large-Scale Deep Learning Models: A Literature Review
Li Shen
Yan Sun
Zhiyuan Yu
Liang Ding
Xinmei Tian
Dacheng Tao
VLM
294
51
0
07 Apr 2023
Do Transformers Parse while Predicting the Masked Word?
Conference on Empirical Methods in Natural Language Processing (EMNLP), 2023
Haoyu Zhao
A. Panigrahi
Rong Ge
Sanjeev Arora
318
39
0
14 Mar 2023
The Framework Tax: Disparities Between Inference Efficiency in NLP Research and Deployment
Conference on Empirical Methods in Natural Language Processing (EMNLP), 2023
Jared Fernandez
Jacob Kahn
Clara Na
Yonatan Bisk
Emma Strubell
FedML
299
13
0
13 Feb 2023
Data Selection for Language Models via Importance Resampling
Neural Information Processing Systems (NeurIPS), 2023
Sang Michael Xie
Shibani Santurkar
Tengyu Ma
Abigail Z. Jacobs
550
269
0
06 Feb 2023
Which Model Shall I Choose? Cost/Quality Trade-offs for Text Classification Tasks
Shi Zong
Joshua Seltzer
Jia Pan
Pan
Kathy Cheng
Jimmy J. Lin
129
4
0
17 Jan 2023
NarrowBERT: Accelerating Masked Language Model Pretraining and Inference
Annual Meeting of the Association for Computational Linguistics (ACL), 2023
Haoxin Li
Phillip Keung
Daniel Cheng
Jungo Kasai
Noah A. Smith
168
4
0
11 Jan 2023
Does compressing activations help model parallel training?
Conference on Machine Learning and Systems (MLSys), 2023
S. Bian
Dacheng Li
Hongyi Wang
Eric P. Xing
Shivaram Venkataraman
219
12
0
06 Jan 2023
Cramming: Training a Language Model on a Single GPU in One Day
International Conference on Machine Learning (ICML), 2022
Jonas Geiping
Tom Goldstein
MoE
268
101
0
28 Dec 2022
Pretraining Without Attention
Conference on Empirical Methods in Natural Language Processing (EMNLP), 2022
Junxiong Wang
J. Yan
Albert Gu
Alexander M. Rush
212
56
0
20 Dec 2022
ExtremeBERT: A Toolkit for Accelerating Pretraining of Customized BERT
Boyao Wang
Shizhe Diao
Jianlin Chen
Tong Zhang
VLM
205
9
0
30 Nov 2022
1
2
Next