Papers
Communities
Events
Blog
Pricing
Search
Open menu
Home
Papers
2401.16380
Cited By
Rephrasing the Web: A Recipe for Compute and Data-Efficient Language Modeling
29 January 2024
Pratyush Maini
Skyler Seto
Richard He Bai
David Grangier
Yizhe Zhang
Navdeep Jaitly
SyDa
Re-assign community
ArXiv
PDF
HTML
Papers citing
"Rephrasing the Web: A Recipe for Compute and Data-Efficient Language Modeling"
15 / 15 papers shown
Title
Synthesize-on-Graph: Knowledgeable Synthetic Data Generation for Continue Pre-training of Large Language Models
Xuhui Jiang
Shengjie Ma
Chengjin Xu
Cehao Yang
Liyu Zhang
Jian Guo
SyDa
20
0
0
02 May 2025
On the generalization of language models from in-context learning and finetuning: a controlled study
Andrew Kyle Lampinen
Arslan Chaudhry
Stephanie Chan
Cody Wild
Diane Wan
Alex Ku
Jorg Bornschein
Razvan Pascanu
Murray Shanahan
James L. McClelland
46
0
0
01 May 2025
Programming Every Example: Lifting Pre-training Data Quality Like Experts at Scale
Fan Zhou
Zengzhi Wang
Qian Liu
Junlong Li
Pengfei Liu
ALM
88
14
0
17 Feb 2025
Training Bilingual LMs with Data Constraints in the Targeted Language
Skyler Seto
Maartje ter Hoeve
He Bai
Natalie Schluter
David Grangier
71
0
0
20 Nov 2024
Latent Paraphrasing: Perturbation on Layers Improves Knowledge Injection in Language Models
Minki Kang
Sung Ju Hwang
Gibbeum Lee
Jaewoong Cho
KELM
30
0
0
01 Nov 2024
Self-calibration for Language Model Quantization and Pruning
Miles Williams
G. Chrysostomou
Nikolaos Aletras
MQ
36
0
0
22 Oct 2024
MIND: Math Informed syNthetic Dialogues for Pretraining LLMs
Syeda Nahida Akter
Shrimai Prabhumoye
John Kamalu
S. Satheesh
Eric Nyberg
M. Patwary
M. Shoeybi
Bryan Catanzaro
LRM
SyDa
ReLM
81
1
0
15 Oct 2024
Task-Adaptive Pretrained Language Models via Clustered-Importance Sampling
David Grangier
Simin Fan
Skyler Seto
Pierre Ablin
34
3
0
30 Sep 2024
Scaling Synthetic Data Creation with 1,000,000,000 Personas
Tao Ge
Xin Chan
Dian Yu
Haitao Mi
Dong Yu
Dong Yu
SyDa
100
89
0
28 Jun 2024
CinePile: A Long Video Question Answering Dataset and Benchmark
Ruchit Rawal
Khalid Saifullah
Ronen Basri
David Jacobs
Gowthami Somepalli
Tom Goldstein
29
39
0
14 May 2024
To Repeat or Not To Repeat: Insights from Scaling LLM under Token-Crisis
Fuzhao Xue
Yao Fu
Wangchunshu Zhou
Zangwei Zheng
Yang You
79
74
0
22 May 2023
Training language models to follow instructions with human feedback
Long Ouyang
Jeff Wu
Xu Jiang
Diogo Almeida
Carroll L. Wainwright
...
Amanda Askell
Peter Welinder
Paul Christiano
Jan Leike
Ryan J. Lowe
OSLM
ALM
301
11,730
0
04 Mar 2022
Whose Language Counts as High Quality? Measuring Language Ideologies in Text Data Selection
Suchin Gururangan
Dallas Card
Sarah K. Drier
E. K. Gade
Leroy Z. Wang
Zeyu Wang
Luke Zettlemoyer
Noah A. Smith
157
72
0
25 Jan 2022
The Pile: An 800GB Dataset of Diverse Text for Language Modeling
Leo Gao
Stella Biderman
Sid Black
Laurence Golding
Travis Hoppe
...
Horace He
Anish Thite
Noa Nabeshima
Shawn Presser
Connor Leahy
AIMat
236
1,508
0
31 Dec 2020
PubMedQA: A Dataset for Biomedical Research Question Answering
Qiao Jin
Bhuwan Dhingra
Zhengping Liu
William W. Cohen
Xinghua Lu
196
791
0
13 Sep 2019
1