Papers
Communities
Events
Blog
Pricing
Search
Open menu
Home
Papers
2210.10951
Cited By
Automatic Document Selection for Efficient Encoder Pretraining
20 October 2022
Yukun Feng
Patrick Xia
Benjamin Van Durme
João Sedoc
Re-assign community
ArXiv
PDF
HTML
Papers citing
"Automatic Document Selection for Efficient Encoder Pretraining"
7 / 7 papers shown
Title
Perplexed by Perplexity: Perplexity-Based Data Pruning With Small Reference Models
Zachary Ankner
Cody Blakeney
Kartik K. Sreenivasan
Max Marion
Matthew L. Leavitt
Mansheej Paul
16
23
0
30 May 2024
Generative Deduplication For Socia Media Data Selection
Xianming Li
Jing Li
16
2
0
11 Jan 2024
DoReMi: Optimizing Data Mixtures Speeds Up Language Model Pretraining
Sang Michael Xie
Hieu H. Pham
Xuanyi Dong
Nan Du
Hanxiao Liu
Yifeng Lu
Percy Liang
Quoc V. Le
Tengyu Ma
Adams Wei Yu
MoMe
MoE
6
169
0
17 May 2023
Pre-train or Annotate? Domain Adaptation with a Constrained Budget
Fan Bai
Alan Ritter
Wei-ping Xu
52
28
0
10 Sep 2021
The Pile: An 800GB Dataset of Diverse Text for Language Modeling
Leo Gao
Stella Biderman
Sid Black
Laurence Golding
Travis Hoppe
...
Horace He
Anish Thite
Noa Nabeshima
Shawn Presser
Connor Leahy
AIMat
236
1,508
0
31 Dec 2020
Code and Named Entity Recognition in StackOverflow
Jeniya Tabassum
Mounica Maddela
Wei-ping Xu
Alan Ritter
52
114
0
04 May 2020
Scaling Laws for Neural Language Models
Jared Kaplan
Sam McCandlish
T. Henighan
Tom B. Brown
B. Chess
R. Child
Scott Gray
Alec Radford
Jeff Wu
Dario Amodei
220
3,054
0
23 Jan 2020
1