Automatic Document Selection for Efficient Encoder Pretraining

20 October 2022

Papers citing "Automatic Document Selection for Efficient Encoder Pretraining"

7 / 7 papers shown

Title
Perplexed by Perplexity: Perplexity-Based Data Pruning With Small Reference Models Zachary Ankner Cody Blakeney Kartik K. Sreenivasan Max Marion Matthew L. Leavitt Mansheej Paul 16 23 0 30 May 2024
Generative Deduplication For Socia Media Data Selection Xianming Li Jing Li 16 2 0 11 Jan 2024
DoReMi: Optimizing Data Mixtures Speeds Up Language Model Pretraining Sang Michael Xie Hieu H. Pham Xuanyi Dong Nan Du Hanxiao Liu Yifeng Lu Percy Liang Quoc V. Le Tengyu Ma Adams Wei Yu MoMe MoE 6 169 0 17 May 2023
Pre-train or Annotate? Domain Adaptation with a Constrained Budget Fan Bai Alan Ritter Wei-ping Xu 52 28 0 10 Sep 2021
The Pile: An 800GB Dataset of Diverse Text for Language Modeling Leo Gao Stella Biderman Sid Black Laurence Golding Travis Hoppe ... Horace He Anish Thite Noa Nabeshima Shawn Presser Connor Leahy AIMat 236 1,508 0 31 Dec 2020
Code and Named Entity Recognition in StackOverflow Jeniya Tabassum Mounica Maddela Wei-ping Xu Alan Ritter 52 114 0 04 May 2020
Scaling Laws for Neural Language Models Jared Kaplan Sam McCandlish T. Henighan Tom B. Brown B. Chess R. Child Scott Gray Alec Radford Jeff Wu Dario Amodei 220 3,054 0 23 Jan 2020