Papers
Communities
Events
Blog
Pricing
Search
Open menu
Home
Papers
2105.05241
Cited By
Addressing "Documentation Debt" in Machine Learning Research: A Retrospective Datasheet for BookCorpus
11 May 2021
Jack Bandy
Nicholas Vincent
Re-assign community
ArXiv
PDF
HTML
Papers citing
"Addressing "Documentation Debt" in Machine Learning Research: A Retrospective Datasheet for BookCorpus"
41 / 41 papers shown
Title
Certified Mitigation of Worst-Case LLM Copyright Infringement
Jingyu Zhang
Jiacan Yu
Marc Marone
Benjamin Van Durme
Daniel Khashabi
MoMe
138
0
0
22 Apr 2025
Provocations from the Humanities for Generative AI Research
Lauren F. Klein
Meredith Martin
André Brock
Maria Antoniak
Melanie Walsh
Jessica Marie Johnson
Lauren Tilton
David M. Mimno
VLM
66
1
0
26 Feb 2025
Tiny Transformers Excel at Sentence Compression
Peter Belcak
Roger Wattenhofer
23
0
0
30 Oct 2024
Consent in Crisis: The Rapid Decline of the AI Data Commons
Shayne Longpre
Robert Mahari
Ariel N. Lee
Campbell Lund
Hamidah Oderinwale
...
Hanlin Li
Daphne Ippolito
Sara Hooker
Jad Kabbara
Sandy Pentland
69
35
0
20 Jul 2024
Position: Measure Dataset Diversity, Don't Just Claim It
Dora Zhao
Jerone T. A. Andrews
Orestis Papakyriakopoulos
Alice Xiang
64
14
0
11 Jul 2024
Data Authenticity, Consent, & Provenance for AI are all broken: what will it take to fix them?
Shayne Longpre
Robert Mahari
Naana Obeng-Marnu
William Brannon
Tobin South
Katy Gero
Sandy Pentland
Jad Kabbara
56
5
0
19 Apr 2024
Machine Unlearning in Large Language Models
Kongyang Chen
Zixin Wang
Bing Mi
Waixi Liu
Shaowei Wang
Xiaojun Ren
Jiaxing Shen
MU
24
11
0
03 Feb 2024
SoUnD Framework: Analyzing (So)cial Representation in (Un)structured (D)ata
Mark Díaz
Sunipa Dev
Emily Reif
Remi Denton
Vinodkumar Prabhakaran
33
3
0
28 Nov 2023
The Data Provenance Initiative: A Large Scale Audit of Dataset Licensing & Attribution in AI
Shayne Longpre
Robert Mahari
Anthony Chen
Naana Obeng-Marnu
Damien Sileo
...
K. Bollacker
Tongshuang Wu
Luis Villa
Sandy Pentland
Sara Hooker
15
55
0
25 Oct 2023
RAI4IoE: Responsible AI for Enabling the Internet of Energy
Minhui Xue
Surya Nepal
Ling Liu
Subbu Sethuvenkatraman
Xingliang Yuan
Carsten Rudolph
Ruoxi Sun
Greg Eisenhauer
29
4
0
20 Sep 2023
MADLAD-400: A Multilingual And Document-Level Large Audited Dataset
Sneha Kudugunta
Isaac Caswell
Biao Zhang
Xavier Garcia
Christopher A. Choquette-Choo
...
Derrick Xin
Aditya Kusupati
Romi Stella
Ankur Bapna
Orhan Firat
59
118
0
09 Sep 2023
Towards Federated Foundation Models: Scalable Dataset Pipelines for Group-Structured Learning
Zachary B. Charles
Nicole Mitchell
Krishna Pillutla
Michael Reneer
Zachary Garrett
FedML
AI4CE
28
28
0
18 Jul 2023
A Pretrainer's Guide to Training Data: Measuring the Effects of Data Age, Domain Coverage, Quality, & Toxicity
Shayne Longpre
Gregory Yauney
Emily Reif
Katherine Lee
Adam Roberts
...
Denny Zhou
Jason W. Wei
Kevin Robinson
David M. Mimno
Daphne Ippolito
21
147
0
22 May 2023
The MiniPile Challenge for Data-Efficient Language Models
Jean Kaddour
MoE
ALM
24
40
0
17 Apr 2023
Right the docs: Characterising voice dataset documentation practices used in machine learning
Kathy Reid
Elizabeth T. Williams
14
2
0
19 Mar 2023
The BigScience ROOTS Corpus: A 1.6TB Composite Multilingual Dataset
Hugo Laurenccon
Lucile Saulnier
Thomas Wang
Christopher Akiki
Albert Villanova del Moral
...
Violette Lepercq
Suzana Ilić
Margaret Mitchell
Sasha Luccioni
Yacine Jernite
AI4CE
AILaw
36
163
0
07 Mar 2023
The ROOTS Search Tool: Data Transparency for LLMs
Aleksandra Piktus
Christopher Akiki
Paulo Villegas
Hugo Laurenccon
Gérard Dupont
A. Luccioni
Yacine Jernite
Anna Rogers
VLM
28
29
0
27 Feb 2023
Trustworthy Social Bias Measurement
Rishi Bommasani
Percy Liang
27
10
0
20 Dec 2022
Noise-Robust De-Duplication at Scale
Emily Silcock
Luca DÁmico-Wong
Jinglin Yang
Melissa Dell
SyDa
31
20
0
09 Oct 2022
Every picture tells a story: Image-grounded controllable stylistic story generation
Holy Lovenia
Bryan Wilie
Romain Barraud
Samuel Cahyawijaya
Willy Chung
Pascale Fung
19
8
0
04 Sep 2022
All That's Happening behind the Scenes: Putting the Spotlight on Volunteer Moderator Labor in Reddit
Hanlin Li
Brent J. Hecht
Stevie Chancellor
19
38
0
28 May 2022
Empathetic Conversational Systems: A Review of Current Advances, Gaps, and Opportunities
Aravind Sesagiri Raamkumar
Yinping Yang
12
28
0
09 May 2022
Data Governance in the Age of Large-Scale Data-Driven Language Technology
Yacine Jernite
Huu Nguyen
Stella Biderman
A. Rogers
Maraim Masoud
...
Jorg Frohberg
Aaron Gokaslan
Peter Henderson
Rishi Bommasani
Margaret Mitchell
18
52
0
04 May 2022
Healthsheet: Development of a Transparency Artifact for Health Datasets
Negar Rostamzadeh
Diana Mincu
Subhrajit Roy
A. Smart
Lauren Wilcox
Mahima Pushkarna
Jessica Schrouff
Razvan Amironesei
Nyalleng Moorosi
Katherine A. Heller
37
62
0
26 Feb 2022
Can Machines Help Us Answering Question 16 in Datasheets, and In Turn Reflecting on Inappropriate Content?
P. Schramowski
Christopher Tauchmann
Kristian Kersting
FaML
14
86
0
14 Feb 2022
Yes-Yes-Yes: Proactive Data Collection for ACL Rolling Review and Beyond
Nils Dycke
Ilia Kuznetsov
Iryna Gurevych
18
10
0
27 Jan 2022
Documenting Geographically and Contextually Diverse Data Sources: The BigScience Catalogue of Language Data and Resources
Angelina McMillan-Major
Zaid Alyafeai
Stella Biderman
Kimbo Chen
F. Toni
...
Aitor Soroa Etxabe
Pedro Ortiz Suarez
Zeerak Talat
Daniel Alexander van Strien
Yacine Jernite
32
14
0
25 Jan 2022
Datasheet for the Pile
Stella Biderman
Kieran Bicheno
Leo Gao
52
35
0
13 Jan 2022
Est-ce que vous compute? Code-switching, cultural identity, and AI
Arianna Falbo
Travis LaCroix
14
8
0
15 Dec 2021
Personalized Benchmarking with the Ludwig Benchmarking Toolkit
A. Narayan
Piero Molino
Karan Goel
W. Neiswanger
Christopher Ré
8
11
0
08 Nov 2021
Recent Advances in Natural Language Processing via Large Pre-Trained Language Models: A Survey
Bonan Min
Hayley L Ross
Elior Sulem
Amir Pouran Ben Veyseh
Thien Huu Nguyen
Oscar Sainz
Eneko Agirre
Ilana Heinz
Dan Roth
LM&MA
VLM
AI4CE
71
1,029
0
01 Nov 2021
IconQA: A New Benchmark for Abstract Diagram Understanding and Visual Language Reasoning
Pan Lu
Liang Qiu
Jiaqi Chen
Tony Xia
Yizhou Zhao
Wei Zhang
Zhou Yu
Xiaodan Liang
Song-Chun Zhu
AIMat
33
183
0
25 Oct 2021
A Framework for Deprecating Datasets: Standardizing Documentation, Identification, and Communication
A. Luccioni
Frances Corry
H. Sridharan
Mike Ananny
J. Schultz
Kate Crawford
46
29
0
18 Oct 2021
Sparks: Inspiration for Science Writing using Language Models
Katy Ilonka Gero
Vivian Liu
Lydia B. Chilton
LRM
101
171
0
14 Oct 2021
Frequency Effects on Syntactic Rule Learning in Transformers
Jason W. Wei
Dan Garrette
Tal Linzen
Ellie Pavlick
82
62
0
14 Sep 2021
Just What do You Think You're Doing, Dave?' A Checklist for Responsible Data Use in NLP
Anna Rogers
Timothy Baldwin
Kobi Leins
104
64
0
14 Sep 2021
Deduplicating Training Data Makes Language Models Better
Katherine Lee
Daphne Ippolito
A. Nystrom
Chiyuan Zhang
Douglas Eck
Chris Callison-Burch
Nicholas Carlini
SyDa
242
591
0
14 Jul 2021
GroupBERT: Enhanced Transformer Architecture with Efficient Grouped Structures
Ivan Chelombiev
Daniel Justus
Douglas Orr
A. Dietrich
Frithjof Gressmann
A. Koliousis
Carlo Luschi
19
5
0
10 Jun 2021
Changing the World by Changing the Data
Anna Rogers
16
71
0
28 May 2021
Corpus-Level Evaluation for Event QA: The IndiaPoliceEvents Corpus Covering the 2002 Gujarat Violence
Andrew Halterman
Katherine A. Keith
Sheikh Muhammad Sarwar
Brendan T. O'Connor
16
27
0
27 May 2021
The Pile: An 800GB Dataset of Diverse Text for Language Modeling
Leo Gao
Stella Biderman
Sid Black
Laurence Golding
Travis Hoppe
...
Horace He
Anish Thite
Noa Nabeshima
Shawn Presser
Connor Leahy
AIMat
253
1,986
0
31 Dec 2020
1