Papers
Communities
Events
Blog
Pricing
Search
Open menu
Home
Papers
2201.07311
Cited By
Datasheet for the Pile
13 January 2022
Stella Biderman
Kieran Bicheno
Leo Gao
Re-assign community
ArXiv
PDF
HTML
Papers citing
"Datasheet for the Pile"
34 / 34 papers shown
Title
The Lucie-7B LLM and the Lucie Training Dataset: Open resources for multilingual language generation
Olivier Gouvert
Julie Hunter
Jérôme Louradour
Christophe Cerisara
Evan Dufraisse
Yaya Sy
Laura Rivière
Jean-Pierre Lorré
OpenLLM-France community
102
0
0
15 Mar 2025
Benchmarking the Performance of Large Language Models on the Cerebras Wafer Scale Engine
Zuoning Zhang
Dhruv Parikh
Youning Zhang
Viktor Prasanna
21
1
0
30 Aug 2024
Consent in Crisis: The Rapid Decline of the AI Data Commons
Shayne Longpre
Robert Mahari
Ariel N. Lee
Campbell Lund
Hamidah Oderinwale
...
Hanlin Li
Daphne Ippolito
Sara Hooker
Jad Kabbara
Sandy Pentland
56
35
0
20 Jul 2024
Block Transformer: Global-to-Local Language Modeling for Fast Inference
Namgyu Ho
Sangmin Bae
Taehyeon Kim
Hyunjik Jo
Yireun Kim
Tal Schuster
Adam Fisch
James Thorne
Se-Young Yun
45
7
0
04 Jun 2024
LMD3: Language Model Data Density Dependence
John Kirchenbauer
Garrett Honke
Gowthami Somepalli
Jonas Geiping
Daphne Ippolito
Katherine Lee
Tom Goldstein
David Andre
30
6
0
10 May 2024
Introducing L2M3, A Multilingual Medical Large Language Model to Advance Health Equity in Low-Resource Regions
Agasthya Gangavarapu
LM&MA
13
6
0
11 Apr 2024
MiniCPM: Unveiling the Potential of Small Language Models with Scalable Training Strategies
Shengding Hu
Yuge Tu
Xu Han
Chaoqun He
Ganqu Cui
...
Chaochao Jia
Guoyang Zeng
Dahai Li
Zhiyuan Liu
Maosong Sun
MoE
38
282
0
09 Apr 2024
Reawakening knowledge: Anticipatory recovery from catastrophic interference via structured training
Yanlai Yang
Matt Jones
Michael C. Mozer
Mengye Ren
51
1
0
14 Mar 2024
DYAD: A Descriptive Yet Abjuring Density efficient approximation to linear neural network layers
S. Chandy
Varun Gangal
Yi Yang
Gabriel Maggiotti
30
0
0
11 Dec 2023
HuatuoGPT-II, One-stage Training for Medical Adaption of LLMs
Junying Chen
Xidong Wang
Anningzhe Gao
Feng Jiang
Shunian Chen
...
Chuyi Kong
Jianquan Li
Xiang Wan
Haizhou Li
Benyou Wang
LM&MA
24
61
0
16 Nov 2023
The Data Provenance Initiative: A Large Scale Audit of Dataset Licensing & Attribution in AI
Shayne Longpre
Robert Mahari
Anthony Chen
Naana Obeng-Marnu
Damien Sileo
...
K. Bollacker
Tongshuang Wu
Luis Villa
Sandy Pentland
Sara Hooker
10
55
0
25 Oct 2023
Large Content And Behavior Models To Understand, Simulate, And Optimize Content And Behavior
Ashmit Khandelwal
Aditya Agrawal
Aanisha Bhattacharyya
Yaman Kumar Singla
Somesh Singh
...
Ishita Dasgupta
Stefano Petrangeli
R. Shah
Changyou Chen
Balaji Krishnamurthy
11
8
0
01 Sep 2023
RecycleGPT: An Autoregressive Language Model with Recyclable Module
Yu Jiang
Qiaozhi He
Xiaomin Zhuang
Zhihua Wu
Kunpeng Wang
Wenlai Zhao
Guangwen Yang
KELM
18
3
0
07 Aug 2023
ZeRO++: Extremely Efficient Collective Communication for Giant Model Training
Guanhua Wang
Heyang Qin
S. A. Jacobs
Connor Holmes
Samyam Rajbhandari
Olatunji Ruwase
Feng Yan
Lei Yang
Yuxiong He
VLM
53
56
0
16 Jun 2023
GAIA Search: Hugging Face and Pyserini Interoperability for NLP Training Data Exploration
Aleksandra Piktus
Odunayo Ogundepo
Christopher Akiki
Akintunde Oladipo
Xinyu Crystina Zhang
Hailey Schoelkopf
Stella Biderman
Martin Potthast
Jimmy J. Lin
CVBM
21
10
0
02 Jun 2023
Large Language Models, scientific knowledge and factuality: A systematic analysis in antibiotic discovery
Magdalena Wysocka
Oskar Wysocki
Maxime Delmas
V. Mutel
André Freitas
LM&MA
25
6
0
28 May 2023
A Language Model of Java Methods with Train/Test Deduplication
Chia-Yi Su
Aakash Bansal
Vijayanta Jain
S. Ghanavati
Collin McMillan
SyDa
VLM
16
9
0
15 May 2023
Emergent and Predictable Memorization in Large Language Models
Stella Biderman
USVSN Sai Prashanth
Lintang Sutawika
Hailey Schoelkopf
Quentin G. Anthony
Shivanshu Purohit
Edward Raf
19
117
0
21 Apr 2023
Should ChatGPT be Biased? Challenges and Risks of Bias in Large Language Models
Emilio Ferrara
SILM
12
247
0
07 Apr 2023
Pythia: A Suite for Analyzing Large Language Models Across Training and Scaling
Stella Biderman
Hailey Schoelkopf
Quentin G. Anthony
Herbie Bradley
Kyle O'Brien
...
USVSN Sai Prashanth
Edward Raff
Aviya Skowron
Lintang Sutawika
Oskar van der Wal
25
1,164
0
03 Apr 2023
Foundation Models and Fair Use
Peter Henderson
Xuechen Li
Dan Jurafsky
Tatsunori Hashimoto
Mark A. Lemley
Percy Liang
12
119
0
28 Mar 2023
Eliciting Latent Predictions from Transformers with the Tuned Lens
Nora Belrose
Zach Furman
Logan Smith
Danny Halawi
Igor V. Ostrovsky
Lev McKinney
Stella Biderman
Jacob Steinhardt
11
192
0
14 Mar 2023
The BigScience ROOTS Corpus: A 1.6TB Composite Multilingual Dataset
Hugo Laurenccon
Lucile Saulnier
Thomas Wang
Christopher Akiki
Albert Villanova del Moral
...
Violette Lepercq
Suzana Ilić
Margaret Mitchell
Sasha Luccioni
Yacine Jernite
AI4CE
AILaw
31
163
0
07 Mar 2023
BLOOM: A 176B-Parameter Open-Access Multilingual Language Model
BigScience Workshop
:
Teven Le Scao
Angela Fan
Christopher Akiki
...
Zhongli Xie
Zifan Ye
M. Bras
Younes Belkada
Thomas Wolf
VLM
65
2,301
0
09 Nov 2022
Data Governance in the Age of Large-Scale Data-Driven Language Technology
Yacine Jernite
Huu Nguyen
Stella Biderman
A. Rogers
Maraim Masoud
...
Jorg Frohberg
Aaron Gokaslan
Peter Henderson
Rishi Bommasani
Margaret Mitchell
11
52
0
04 May 2022
GPT-NeoX-20B: An Open-Source Autoregressive Language Model
Sid Black
Stella Biderman
Eric Hallahan
Quentin G. Anthony
Leo Gao
...
Shivanshu Purohit
Laria Reynolds
J. Tow
Benqi Wang
Samuel Weinbach
18
797
0
14 Apr 2022
Documenting Geographically and Contextually Diverse Data Sources: The BigScience Catalogue of Language Data and Resources
Angelina McMillan-Major
Zaid Alyafeai
Stella Biderman
Kimbo Chen
F. Toni
...
Aitor Soroa Etxabe
Pedro Ortiz Suarez
Zeerak Talat
Daniel Alexander van Strien
Yacine Jernite
17
14
0
25 Jan 2022
Fast Model Editing at Scale
E. Mitchell
Charles Lin
Antoine Bosselut
Chelsea Finn
Christopher D. Manning
KELM
219
341
0
21 Oct 2021
Phrase-BERT: Improved Phrase Embeddings from BERT with an Application to Corpus Exploration
Shufan Wang
Laure Thompson
Mohit Iyyer
169
66
0
13 Sep 2021
Intersectional Bias in Causal Language Models
Liam Magee
Lida Ghahremanlou
K. Soldatić
S. Robertson
189
31
0
16 Jul 2021
Deduplicating Training Data Makes Language Models Better
Katherine Lee
Daphne Ippolito
A. Nystrom
Chiyuan Zhang
Douglas Eck
Chris Callison-Burch
Nicholas Carlini
SyDa
237
588
0
14 Jul 2021
The Pile: An 800GB Dataset of Diverse Text for Language Modeling
Leo Gao
Stella Biderman
Sid Black
Laurence Golding
Travis Hoppe
...
Horace He
Anish Thite
Noa Nabeshima
Shawn Presser
Connor Leahy
AIMat
245
1,986
0
31 Dec 2020
Big Bird: Transformers for Longer Sequences
Manzil Zaheer
Guru Guruganesh
Kumar Avinava Dubey
Joshua Ainslie
Chris Alberti
...
Philip Pham
Anirudh Ravula
Qifan Wang
Li Yang
Amr Ahmed
VLM
249
2,009
0
28 Jul 2020
Distill, Adapt, Distill: Training Small, In-Domain Models for Neural Machine Translation
Mitchell A. Gordon
Kevin Duh
CLL
VLM
19
13
0
05 Mar 2020
1