Papers
Communities
Events
Blog
Pricing
Search
Open menu
Home
Papers
2505.17841
Cited By
TEDI: Trustworthy and Ethical Dataset Indicators to Analyze and Compare Dataset Documentation
23 May 2025
Wiebke Hutiri
Mircea Cimpoi
M. Scheuerman
Victoria Matthews
Alice Xiang
Re-assign community
ArXiv (abs)
PDF
HTML
Papers citing
"TEDI: Trustworthy and Ethical Dataset Indicators to Analyze and Compare Dataset Documentation"
36 / 36 papers shown
Title
Position: Measure Dataset Diversity, Don't Just Claim It
Dora Zhao
Jerone T. A. Andrews
Orestis Papakyriakopoulos
Alice Xiang
100
20
0
11 Jul 2024
A Standardized Machine-readable Dataset Documentation Format for Responsible AI
Nitisha Jain
Mubashara Akhtar
Joan Giner-Miguelez
Rajat Shinde
Joaquin Vanschoren
...
Costanza Conforti
Michael Kuchnik
Lora Aroyo
Omar Benjelloun
Elena Simperl
76
3
0
04 Jun 2024
YODAS: Youtube-Oriented Dataset for Audio and Speech
Xinjian Li
Shinnosuke Takamichi
Takaaki Saeki
William Chen
Sayaka Shiota
Shinji Watanabe
123
27
0
02 Jun 2024
Data Authenticity, Consent, & Provenance for AI are all broken: what will it take to fix them?
Shayne Longpre
Robert Mahari
Naana Obeng-Marnu
William Brannon
Tobin South
Katy Gero
Sandy Pentland
Jad Kabbara
91
8
0
19 Apr 2024
Croissant: A Metadata Format for ML-Ready Datasets
Mubashara Akhtar
Omar Benjelloun
Costanza Conforti
Pieter Gijsbers
Joan Giner-Miguelez
...
Slava Tykhonov
Joaquin Vanschoren
Jos van der Velde
Steffen Vogler
Carole-Jean Wu
80
39
0
28 Mar 2024
Fairness Feedback Loops: Training on Synthetic Data Amplifies Bias
Sierra Wyllie
Ilia Shumailov
Nicolas Papernot
85
32
0
12 Mar 2024
Navigating Dataset Documentations in AI: A Large-Scale Analysis of Dataset Cards on Hugging Face
Xinyu Yang
Weixin Liang
James Zou
CVBM
82
20
0
24 Jan 2024
Open Datasheets: Machine-readable Documentation for Open Datasets and Responsible AI Assessments
Anthony C. Roman
Jennifer Wortman Vaughan
Valerie See
Steph Ballard
Jehu Torres Vega
Caleb Robinson
J. L. Ferres
52
5
0
11 Dec 2023
DMLR: Data-centric Machine Learning Research -- Past, Present and Future
Luis Oala
M. Maskey
Lilith Bat-Leah
Alicia Parrish
Nezihe Merve Gürel
...
Lora Aroyo
Ce Zhang
Joaquin Vanschoren
Isabelle Guyon
Peter Mattson
AI4CE
74
12
0
21 Nov 2023
What's In My Big Data?
Yanai Elazar
Akshita Bhagia
Ian H. Magnusson
Abhilasha Ravichander
Dustin Schwenk
...
Luca Soldaini
Sameer Singh
Hanna Hajishirzi
Noah A. Smith
Jesse Dodge
78
95
0
31 Oct 2023
The Data Provenance Initiative: A Large Scale Audit of Dataset Licensing & Attribution in AI
Shayne Longpre
Robert Mahari
Anthony Chen
Naana Obeng-Marnu
Damien Sileo
...
K. Bollacker
Tongshuang Wu
Luis Villa
Sandy Pentland
Sara Hooker
95
64
0
25 Oct 2023
Libriheavy: a 50,000 hours ASR corpus with punctuation casing and context
Wei Kang
Xiaoyu Yang
Zengwei Yao
Fangjun Kuang
Yifan Yang
Liyong Guo
Long Lin
Daniel Povey
83
56
0
15 Sep 2023
Uncurated Image-Text Datasets: Shedding Light on Demographic Bias
Noa Garcia
Yusuke Hirota
Yankun Wu
Yuta Nakashima
EGVM
85
57
0
06 Apr 2023
Ethical Considerations for Responsible Data Curation
Jerone T. A. Andrews
Dora Zhao
William Thong
Apostolos Modas
Orestis Papakyriakopoulos
Alice Xiang
126
22
0
07 Feb 2023
Contrastive Language-Vision AI Models Pretrained on Web-Scraped Multimodal Data Exhibit Sexual Objectification Bias
Robert Wolfe
Yiwei Yang
Billy Howe
Aylin Caliskan
DiffM
117
57
0
21 Dec 2022
Measuring Data
Margaret Mitchell
A. Luccioni
Nathan Lambert
Marissa Gerchick
Angelina McMillan-Major
Ezinwanne Ozoani
Nazneen Rajani
Tristan Thrush
Yacine Jernite
Douwe Kiela
84
17
0
09 Dec 2022
Gender Artifacts in Visual Datasets
Nicole Meister
Dora Zhao
Angelina Wang
V. V. Ramaswamy
Ruth C. Fong
Olga Russakovsky
70
29
0
18 Jun 2022
CrowdWorkSheets: Accounting for Individual and Collective Identities Underlying Crowdsourced Dataset Annotation
Mark Díaz
Ian D Kivlichan
Rachel Rosen
Dylan K. Baker
Razvan Amironesei
Vinodkumar Prabhakaran
Emily L. Denton
63
85
0
09 Jun 2022
FLEURS: Few-shot Learning Evaluation of Universal Representations of Speech
Alexis Conneau
Min Ma
Simran Khanuja
Yu Zhang
Vera Axelrod
Siddharth Dalmia
Jason Riesa
Clara E. Rivera
Ankur Bapna
VLM
153
331
0
25 May 2022
Data Cards: Purposeful and Transparent Dataset Documentation for Responsible AI
Mahima Pushkarna
Andrew Zaldivar
Oddur Kjartansson
AI4TS
106
221
0
03 Apr 2022
Representation Bias in Data: A Survey on Identification and Resolution Techniques
N. Shahbazi
Yin Lin
Abolfazl Asudeh
H. V. Jagadish
84
76
0
22 Mar 2022
The Dataset Nutrition Label (2nd Gen): Leveraging Context to Mitigate Harms in Artificial Intelligence
Kasia Chmielinski
S. Newman
Matt Taylor
Joshua Joseph
Kemi Thomas
Jessica Yurkofsky
Yue Qiu
76
53
0
10 Jan 2022
Ego4D: Around the World in 3,000 Hours of Egocentric Video
Kristen Grauman
Andrew Westbury
Eugene Byrne
Zachary Chavis
Antonino Furnari
...
Mike Zheng Shou
Antonio Torralba
Lorenzo Torresani
Mingfei Yan
Jitendra Malik
EgoV
418
1,114
0
13 Oct 2021
GigaSpeech: An Evolving, Multi-domain ASR Corpus with 10,000 Hours of Transcribed Audio
Guoguo Chen
Shuzhou Chai
Guan-Bo Wang
Jiayu Du
Weiqiang Zhang
...
Xuchen Yao
Yongqing Wang
Yujun Wang
Zhao You
Zhiyong Yan
123
385
0
13 Jun 2021
Documenting Large Webtext Corpora: A Case Study on the Colossal Clean Crawled Corpus
Jesse Dodge
Maarten Sap
Ana Marasović
William Agnew
Gabriel Ilharco
Dirk Groeneveld
Margaret Mitchell
Matt Gardner
AILaw
124
452
0
18 Apr 2021
VoxPopuli: A Large-Scale Multilingual Speech Corpus for Representation Learning, Semi-Supervised Learning and Interpretation
Changhan Wang
M. Rivière
Ann Lee
Anne Wu
Chaitanya Talnikar
Daniel Haziza
Mary Williamson
J. Pino
Emmanuel Dupoux
SSL
113
496
0
02 Jan 2021
MLS: A Large-Scale Multilingual Dataset for Speech Research
Vineel Pratap
Qiantong Xu
Anuroop Sriram
Gabriel Synnaeve
R. Collobert
AuLLM
136
512
0
07 Dec 2020
Large image datasets: A pyrrhic win for computer vision?
Vinay Uday Prabhu
Abeba Birhane
90
367
0
24 Jun 2020
Mitigating Gender Bias in Captioning Systems
Ruixiang Tang
Mengnan Du
Yuening Li
Zirui Liu
Na Zou
Helen Zhou
FaML
94
66
0
15 Jun 2020
Measuring Social Biases in Grounded Vision and Language Embeddings
Candace Ross
Boris Katz
Andrei Barbu
95
65
0
20 Feb 2020
Libri-Light: A Benchmark for ASR with Limited or No Supervision
Jacob Kahn
M. Rivière
Weiyi Zheng
Evgeny Kharitonov
Qiantong Xu
...
Tatiana Likhomanenko
Gabriel Synnaeve
Armand Joulin
Abdel-rahman Mohamed
Emmanuel Dupoux
AuLLM
82
674
0
17 Dec 2019
Common Voice: A Massively-Multilingual Speech Corpus
Rosana Ardila
Megan Branson
Kelly Davis
Michael Henretty
M. Kohler
Josh Meyer
Reuben Morais
Lindsay Saunders
Francis M. Tyers
Gregor Weber
VLM
102
1,622
0
13 Dec 2019
MELD: A Multimodal Multi-Party Dataset for Emotion Recognition in Conversations
Soujanya Poria
Devamanyu Hazarika
Navonil Majumder
Gautam Naik
Min Zhang
Rada Mihalcea
123
1,082
0
05 Oct 2018
VoxCeleb2: Deep Speaker Recognition
Joon Son Chung
Arsha Nagrani
Andrew Zisserman
362
2,289
0
14 Jun 2018
Datasheets for Datasets
Timnit Gebru
Jamie Morgenstern
Briana Vecchione
Jennifer Wortman Vaughan
Hanna M. Wallach
Hal Daumé
Kate Crawford
302
2,201
0
23 Mar 2018
VoxCeleb: a large-scale speaker identification dataset
Arsha Nagrani
Joon Son Chung
Andrew Zisserman
131
2,287
0
26 Jun 2017
1