v1v2 (latest)

How to Train BERT with an Academic Budget

Conference on Empirical Methods in Natural Language Processing (EMNLP), 2021

15 April 2021

Peter Izsak

Moshe Berchansky

Omer Levy

ArXiv (abs)PDF HTML HuggingFace (2 upvotes)

Papers citing "How to Train BERT with an Academic Budget"

50 / 71 papers shown

Title
What is the Best Sequence Length for BABYLM? Suchir Salhan Richard Diehl Martinez Zébulon Goriely P. Buttery 96 1 0 22 Oct 2025
Patent Language Model Pretraining with ModernBERT Amirhossein Yousefiramandi Ciaran Cooney AILaw VLM 293 2 0 18 Sep 2025
Stepsize anything: A unified learning rate schedule for budgeted-iteration training Anda Tang Yiming Dong Yutao Zeng zhou Xun Zhouchen Lin 603 1 0 30 May 2025
MaxPoolBERT: Enhancing BERT Classification via Layer- and Token-Wise Aggregation Maike Behrendt Stefan Sylvius Wagner Stefan Harmeling SSeg 508 1 0 21 May 2025
Prediction-powered estimators for finite population statistics in highly imbalanced textual data: Public hate crime estimation Hannes Waldetoft Jakob Torgander Måns Magnusson 226 2 0 05 May 2025
Findings of the BabyLM Challenge: Sample-Efficient Pretraining on Developmentally Plausible Corpora Alex Warstadt Aaron Mueller Leshem Choshen E. Wilcox Chengxu Zhuang ... Rafael Mosquera Bhargavi Paranjape Adina Williams Tal Linzen Robert Bamler 596 166 0 10 Apr 2025
Slamming: Training a Speech Language Model on One GPU in a DayAnnual Meeting of the Association for Computational Linguistics (ACL), 2025 Gallil Maimon Avishai Elmakies Yossi Adi 313 9 0 19 Feb 2025
A distributional simplicity bias in the learning dynamics of transformersNeural Information Processing Systems (NeurIPS), 2024 Riccardo Rende Federica Gerace Alessandro Laio Sebastian Goldt 397 14 0 17 Feb 2025
Smarter, Better, Faster, Longer: A Modern Bidirectional Encoder for Fast, Memory Efficient, and Long Context Finetuning and Inference Benjamin Warner Antoine Chaffin Benjamin Clavié Orion Weller Oskar Hallström ... Tom Aarsen Nathan Cooper Griffin Adams Jeremy Howard Iacopo Poli 449 372 0 18 Dec 2024
Normalization Layer Per-Example Gradients are Sufficient to Predict Gradient Noise Scale in TransformersNeural Information Processing Systems (NeurIPS), 2024 Gavia Gray Aman Tiwari Shane Bergsma Joel Hestness 353 2 0 01 Nov 2024
$100K or 100 Days: Trade-offs when Pre-Training with Academic Resources Apoorv Khandelwal Tian Yun Nihal V. Nayak Jack Merullo Stephen H. Bach Chen Sun Ellie Pavlick VLM AI4CE OnRL 270 6 0 30 Oct 2024
Team Ryu's Submission to SIGMORPHON 2024 Shared Task on Subword Tokenization Zilong Li 199 0 0 19 Oct 2024
Exploring the Benefit of Activation Sparsity in Pre-trainingInternational Conference on Machine Learning (ICML), 2024 Zhengyan Zhang Chaojun Xiao Qiujieli Qin Yankai Lin Zhiyuan Zeng Xu Han Zhiyuan Liu Ruobing Xie Maosong Sun Jie Zhou MoE 223 6 0 04 Oct 2024
Expanding Expressivity in Transformer Models with MöbiusAttention Anna-Maria Halacheva M. Nayyeri Steffen Staab 199 1 0 08 Sep 2024
Hydra: Bidirectional State Space Models Through Generalized Matrix Mixers Sukjun Hwang Aakash Lahoti Tri Dao Albert Gu Mamba 317 37 0 13 Jul 2024
Aligning Programming Language and Natural Language: Exploring Design Choices in Multi-Modal Transformer-Based Embedding for Bug Localization Partha Chakraborty Venkatraman Arumugam M. Nagappan 172 0 0 25 Jun 2024
Knowledge Distillation vs. Pretraining from Scratch under a Fixed (Computation) Budget Minh Duc Bui Fabian David Schmidt Goran Glavaš Katharina von der Wense 169 1 0 30 Apr 2024
PeLLE: Encoder-based language models for Brazilian Portuguese based on open data Guilherme Lamartine de Mello Marcelo Finger F. Serras M. Carpi Marcos Menon Jose Pedro Henrique Domingues Paulo Cavalim 267 1 0 29 Feb 2024
Orchid: Flexible and Data-Dependent Convolution for Sequence Modeling Mahdi Karami Ali Ghodsi VLM 344 8 0 28 Feb 2024
The Effect of Batch Size on Contrastive Self-Supervised Speech Representation Learning Nik Vaessen David A. van Leeuwen 300 5 0 21 Feb 2024
The Compute Divide in Machine Learning: A Threat to Academic Contribution and Scrutiny? T. Besiroglu S. Bergerson Amelia Michael Lennart Heim Xueyun Luo Neil Thompson 234 21 0 04 Jan 2024
MosaicBERT: A Bidirectional Encoder Optimized for Fast PretrainingNeural Information Processing Systems (NeurIPS), 2023 Jacob P. Portes Alex Trott Sam Havens Daniel King Abhinav Venigalla Moin Nadeem Nikhil Sardana D. Khudia Jonathan Frankle 293 32 0 29 Dec 2023
Spike No More: Stabilizing the Pre-training of Large Language Models Sho Takase Shun Kiyono Sosuke Kobayashi Jun Suzuki 402 26 0 28 Dec 2023
Accelerated Convergence of Stochastic Heavy Ball Method under Anisotropic Gradient Noise Boyao Wang Yuxing Liu Xiaoyu Wang Tong Zhang 210 7 0 22 Dec 2023
CLIMB: Curriculum Learning for Infant-inspired Model Building Richard Diehl Martinez Zébulon Goriely Hope McGovern Christopher Davis Andrew Caines P. Buttery Lisa Beinborn 203 20 0 15 Nov 2023
Explicit Morphological Knowledge Improves Pre-training of Language Models for Hebrew Eylon Gueta Omer Goldman Reut Tsarfaty 151 3 0 01 Nov 2023
A Quadratic Synchronization Rule for Distributed Deep LearningInternational Conference on Learning Representations (ICLR), 2023 Xinran Gu Kaifeng Lyu Sanjeev Arora Jingzhao Zhang Longbo Huang 297 4 0 22 Oct 2023
A Simple and Robust Framework for Cross-Modality Medical Image Segmentation applied to Vision Transformers Matteo Bastico David Ryckelynck Laurent Corté Yannick Tillier Etienne Decencière MedIm ViT 191 4 0 09 Oct 2023
M-FLAG: Medical Vision-Language Pre-training with Frozen Language Models and Latent Space Geometry OptimizationInternational Conference on Medical Image Computing and Computer-Assisted Intervention (MICCAI), 2023 Che Liu Sibo Cheng Chong Chen Mengyun Qiao Weitong Zhang Anand Shah Wenjia Bai Rossella Arcucci VLM 396 71 0 17 Jul 2023
No Train No Gain: Revisiting Efficient Training Algorithms For Transformer-based Language ModelsNeural Information Processing Systems (NeurIPS), 2023 Jean Kaddour Oscar Key Piotr Nawrot Pasquale Minervini Matt J. Kusner 406 58 0 12 Jul 2023
Biomedical Language Models are Robust to Sub-optimal TokenizationWorkshop on Biomedical Natural Language Processing (BioNLP), 2023 Bernal Jiménez Gutiérrez Huan Sun Yu-Chuan Su 183 8 0 30 Jun 2023
Surveying (Dis)Parities and Concerns of Compute Hungry NLP Research Ji-Ung Lee Haritz Puerto Betty van Aken Yuki Arase Jessica Zosa Forde ... Andreas Rucklé Iryna Gurevych Roy Schwartz Emma Strubell Jesse Dodge 260 7 0 29 Jun 2023
Lost in Translation: Large Language Models in Non-English Content Analysis Gabriel Nicholas Aliya Bhatia ELM 265 54 0 12 Jun 2023
Mixture-of-Supernets: Improving Weight-Sharing Supernet Training with Architecture-Routed Mixture-of-ExpertsAnnual Meeting of the Association for Computational Linguistics (ACL), 2023 Ganesh Jawahar Haichuan Yang Yunyang Xiong Zechun Liu Dilin Wang ... Barlas Oğuz Muhammad Abdul-Mageed L. Lakshmanan Raghuraman Krishnamoorthi Vikas Chandra 183 6 0 08 Jun 2023
Data-Efficient French Language Modeling with CamemBERTaAnnual Meeting of the Association for Computational Linguistics (ACL), 2023 Wissam Antoun Benoît Sagot Djamé Seddah 147 8 0 02 Jun 2023
How to Distill your BERT: An Empirical Study on the Impact of Weight Initialisation and Distillation ObjectivesAnnual Meeting of the Association for Computational Linguistics (ACL), 2023 Xinpeng Wang Leonie Weissweiler Hinrich Schütze Barbara Plank 114 11 0 24 May 2023
Sophia: A Scalable Stochastic Second-order Optimizer for Language Model Pre-trainingInternational Conference on Learning Representations (ICLR), 2023 Hong Liu Zhiyuan Li David Leo Wright Hall Abigail Z. Jacobs Tengyu Ma VLM 582 218 0 23 May 2023
Cuttlefish: Low-Rank Model Training without All the TuningConference on Machine Learning and Systems (MLSys), 2023 Hongyi Wang Saurabh Agarwal Pongsakorn U-chupala Yoshiki Tanaka Eric P. Xing Dimitris Papailiopoulos OffRL 296 26 0 04 May 2023
Distill or Annotate? Cost-Efficient Fine-Tuning of Compact ModelsAnnual Meeting of the Association for Computational Linguistics (ACL), 2023 Junmo Kang Wei Xu Alan Ritter 296 16 0 02 May 2023
The MiniPile Challenge for Data-Efficient Language Models Jean Kaddour MoE ALM 320 63 0 17 Apr 2023
On Efficient Training of Large-Scale Deep Learning Models: A Literature Review Li Shen Yan Sun Zhiyuan Yu Liang Ding Xinmei Tian Dacheng Tao VLM 294 51 0 07 Apr 2023
Do Transformers Parse while Predicting the Masked Word?Conference on Empirical Methods in Natural Language Processing (EMNLP), 2023 Haoyu Zhao A. Panigrahi Rong Ge Sanjeev Arora 318 39 0 14 Mar 2023
The Framework Tax: Disparities Between Inference Efficiency in NLP Research and DeploymentConference on Empirical Methods in Natural Language Processing (EMNLP), 2023 Jared Fernandez Jacob Kahn Clara Na Yonatan Bisk Emma Strubell FedML 299 13 0 13 Feb 2023
Data Selection for Language Models via Importance ResamplingNeural Information Processing Systems (NeurIPS), 2023 Sang Michael Xie Shibani Santurkar Tengyu Ma Abigail Z. Jacobs 550 269 0 06 Feb 2023
Which Model Shall I Choose? Cost/Quality Trade-offs for Text Classification Tasks Shi Zong Joshua Seltzer Jia Pan Pan Kathy Cheng Jimmy J. Lin 129 4 0 17 Jan 2023
NarrowBERT: Accelerating Masked Language Model Pretraining and InferenceAnnual Meeting of the Association for Computational Linguistics (ACL), 2023 Haoxin Li Phillip Keung Daniel Cheng Jungo Kasai Noah A. Smith 168 4 0 11 Jan 2023
Does compressing activations help model parallel training?Conference on Machine Learning and Systems (MLSys), 2023 S. Bian Dacheng Li Hongyi Wang Eric P. Xing Shivaram Venkataraman 219 12 0 06 Jan 2023
Cramming: Training a Language Model on a Single GPU in One DayInternational Conference on Machine Learning (ICML), 2022 Jonas Geiping Tom Goldstein MoE 268 101 0 28 Dec 2022
Pretraining Without AttentionConference on Empirical Methods in Natural Language Processing (EMNLP), 2022 Junxiong Wang J. Yan Albert Gu Alexander M. Rush 212 56 0 20 Dec 2022
ExtremeBERT: A Toolkit for Accelerating Pretraining of Customized BERT Boyao Wang Shizhe Diao Jianlin Chen Tong Zhang VLM 205 9 0 30 Nov 2022

All Papers

How to Train BERT with an Academic Budget

Papers citing "How to Train BERT with an Academic Budget"