ResearchTrend.AI
  • Communities
  • Connect sessions
  • AI calendar
  • Organizations
  • Join Slack
  • Contact Sales
Papers
Communities
Social Events
Terms and Conditions
Pricing
Contact Sales
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2026 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2505.17841
  4. Cited By
TEDI: Trustworthy and Ethical Dataset Indicators to Analyze and Compare Dataset Documentation

TEDI: Trustworthy and Ethical Dataset Indicators to Analyze and Compare Dataset Documentation

23 May 2025
Wiebke Hutiri
Mircea Cimpoi
M. Scheuerman
Victoria Matthews
Alice Xiang
ArXiv (abs)PDFHTML

Papers citing "TEDI: Trustworthy and Ethical Dataset Indicators to Analyze and Compare Dataset Documentation"

36 / 36 papers shown
Position: Measure Dataset Diversity, Don't Just Claim It
Position: Measure Dataset Diversity, Don't Just Claim It
Dora Zhao
Jerone T. A. Andrews
Orestis Papakyriakopoulos
Alice Xiang
283
31
0
11 Jul 2024
A Standardized Machine-readable Dataset Documentation Format for
  Responsible AI
A Standardized Machine-readable Dataset Documentation Format for Responsible AI
Nitisha Jain
Mubashara Akhtar
Joan Giner-Miguelez
Rajat Shinde
Joaquin Vanschoren
...
Costanza Conforti
Michael Kuchnik
Lora Aroyo
Omar Benjelloun
Elena Simperl
220
6
0
04 Jun 2024
YODAS: Youtube-Oriented Dataset for Audio and Speech
YODAS: Youtube-Oriented Dataset for Audio and Speech
Xinjian Li
Shinnosuke Takamichi
Takaaki Saeki
William Chen
Sayaka Shiota
Shinji Watanabe
363
53
0
02 Jun 2024
Data Authenticity, Consent, & Provenance for AI are all broken: what
  will it take to fix them?
Data Authenticity, Consent, & Provenance for AI are all broken: what will it take to fix them?
Shayne Longpre
Robert Mahari
Naana Obeng-Marnu
William Brannon
Tobin South
Katy Gero
Sandy Pentland
Jad Kabbara
298
22
0
19 Apr 2024
Croissant: A Metadata Format for ML-Ready Datasets
Croissant: A Metadata Format for ML-Ready Datasets
Mubashara Akhtar
Omar Benjelloun
Costanza Conforti
Pieter Gijsbers
Joan Giner-Miguelez
...
Slava Tykhonov
Joaquin Vanschoren
Jos van der Velde
Steffen Vogler
Carole-Jean Wu
333
66
0
28 Mar 2024
Fairness Feedback Loops: Training on Synthetic Data Amplifies Bias
Fairness Feedback Loops: Training on Synthetic Data Amplifies BiasConference on Fairness, Accountability and Transparency (FAccT), 2024
Sierra Wyllie
Ilia Shumailov
Nicolas Papernot
242
52
0
12 Mar 2024
Navigating Dataset Documentations in AI: A Large-Scale Analysis of
  Dataset Cards on Hugging Face
Navigating Dataset Documentations in AI: A Large-Scale Analysis of Dataset Cards on Hugging FaceInternational Conference on Learning Representations (ICLR), 2024
Xinyu Yang
Weixin Liang
James Zou
CVBM
223
36
0
24 Jan 2024
Open Datasheets: Machine-readable Documentation for Open Datasets and
  Responsible AI Assessments
Open Datasheets: Machine-readable Documentation for Open Datasets and Responsible AI Assessments
Anthony C. Roman
Jennifer Wortman Vaughan
Valerie See
Steph Ballard
Jehu Torres Vega
Caleb Robinson
J. L. Ferres
239
11
0
11 Dec 2023
DMLR: Data-centric Machine Learning Research -- Past, Present and Future
DMLR: Data-centric Machine Learning Research -- Past, Present and Future
Luis Oala
M. Maskey
Lilith Bat-Leah
Alicia Parrish
Nezihe Merve Gürel
...
Lora Aroyo
Ce Zhang
Joaquin Vanschoren
Isabelle Guyon
Peter Mattson
AI4CE
273
17
0
21 Nov 2023
What's In My Big Data?
What's In My Big Data?
Yanai Elazar
Akshita Bhagia
Ian H. Magnusson
Abhilasha Ravichander
Dustin Schwenk
...
Luca Soldaini
Sameer Singh
Hanna Hajishirzi
Noah A. Smith
Jesse Dodge
242
123
0
31 Oct 2023
The Data Provenance Initiative: A Large Scale Audit of Dataset Licensing
  & Attribution in AI
The Data Provenance Initiative: A Large Scale Audit of Dataset Licensing & Attribution in AI
Shayne Longpre
Robert Mahari
Anthony Chen
Naana Obeng-Marnu
Damien Sileo
...
K. Bollacker
Tongshuang Wu
Luis Villa
Sandy Pentland
Sara Hooker
336
88
0
25 Oct 2023
Libriheavy: a 50,000 hours ASR corpus with punctuation casing and
  context
Libriheavy: a 50,000 hours ASR corpus with punctuation casing and contextIEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), 2023
Wei Kang
Xiaoyu Yang
Zengwei Yao
Fangjun Kuang
Yifan Yang
Liyong Guo
Long Lin
Daniel Povey
265
115
0
15 Sep 2023
Uncurated Image-Text Datasets: Shedding Light on Demographic Bias
Uncurated Image-Text Datasets: Shedding Light on Demographic BiasComputer Vision and Pattern Recognition (CVPR), 2023
Noa Garcia
Yusuke Hirota
Yankun Wu
Yuta Nakashima
EGVM
202
71
0
06 Apr 2023
Ethical Considerations for Responsible Data Curation
Ethical Considerations for Responsible Data CurationNeural Information Processing Systems (NeurIPS), 2023
Jerone T. A. Andrews
Dora Zhao
William Thong
Apostolos Modas
Orestis Papakyriakopoulos
Alice Xiang
433
31
0
07 Feb 2023
Contrastive Language-Vision AI Models Pretrained on Web-Scraped
  Multimodal Data Exhibit Sexual Objectification Bias
Contrastive Language-Vision AI Models Pretrained on Web-Scraped Multimodal Data Exhibit Sexual Objectification BiasConference on Fairness, Accountability and Transparency (FAccT), 2022
Robert Wolfe
Yiwei Yang
Billy Howe
Aylin Caliskan
DiffM
337
73
0
21 Dec 2022
Measuring Data
Measuring Data
Margaret Mitchell
A. Luccioni
Nathan Lambert
Marissa Gerchick
Angelina McMillan-Major
Ezinwanne Ozoani
Nazneen Rajani
Tristan Thrush
Yacine Jernite
Douwe Kiela
242
19
0
09 Dec 2022
Gender Artifacts in Visual Datasets
Gender Artifacts in Visual DatasetsIEEE International Conference on Computer Vision (ICCV), 2022
Nicole Meister
Dora Zhao
Angelina Wang
V. V. Ramaswamy
Ruth C. Fong
Olga Russakovsky
301
36
0
18 Jun 2022
CrowdWorkSheets: Accounting for Individual and Collective Identities
  Underlying Crowdsourced Dataset Annotation
CrowdWorkSheets: Accounting for Individual and Collective Identities Underlying Crowdsourced Dataset AnnotationConference on Fairness, Accountability and Transparency (FAccT), 2022
Mark Díaz
Ian D Kivlichan
Rachel Rosen
Dylan K. Baker
Razvan Amironesei
Vinodkumar Prabhakaran
Emily L. Denton
228
99
0
09 Jun 2022
FLEURS: Few-shot Learning Evaluation of Universal Representations of
  Speech
FLEURS: Few-shot Learning Evaluation of Universal Representations of SpeechSpoken Language Technology Workshop (SLT), 2022
Alexis Conneau
Min Ma
Simran Khanuja
Yu Zhang
Vera Axelrod
Siddharth Dalmia
Jason Riesa
Clara E. Rivera
Ankur Bapna
VLM
506
488
0
25 May 2022
Data Cards: Purposeful and Transparent Dataset Documentation for
  Responsible AI
Data Cards: Purposeful and Transparent Dataset Documentation for Responsible AIConference on Fairness, Accountability and Transparency (FAccT), 2022
Mahima Pushkarna
Andrew Zaldivar
Oddur Kjartansson
AI4TS
251
270
0
03 Apr 2022
Representation Bias in Data: A Survey on Identification and Resolution
  Techniques
Representation Bias in Data: A Survey on Identification and Resolution TechniquesACM Computing Surveys (ACM CSUR), 2022
N. Shahbazi
Yin Lin
Abolfazl Asudeh
H. V. Jagadish
294
111
0
22 Mar 2022
The Dataset Nutrition Label (2nd Gen): Leveraging Context to Mitigate
  Harms in Artificial Intelligence
The Dataset Nutrition Label (2nd Gen): Leveraging Context to Mitigate Harms in Artificial Intelligence
Kasia Chmielinski
S. Newman
Matt Taylor
Joshua Joseph
Kemi Thomas
Jessica Yurkofsky
Yue Qiu
220
68
0
10 Jan 2022
Ego4D: Around the World in 3,000 Hours of Egocentric Video
Ego4D: Around the World in 3,000 Hours of Egocentric Video
Kristen Grauman
Andrew Westbury
Eugene Byrne
Zachary Chavis
Antonino Furnari
...
Mike Zheng Shou
Antonio Torralba
Lorenzo Torresani
Mingfei Yan
Jitendra Malik
EgoV
1.0K
1,486
0
13 Oct 2021
GigaSpeech: An Evolving, Multi-domain ASR Corpus with 10,000 Hours of
  Transcribed Audio
GigaSpeech: An Evolving, Multi-domain ASR Corpus with 10,000 Hours of Transcribed AudioInterspeech (Interspeech), 2021
Guoguo Chen
Shuzhou Chai
Guan-Bo Wang
Jiayu Du
Weiqiang Zhang
...
Xuchen Yao
Yongqing Wang
Yujun Wang
Zhao You
Zhiyong Yan
386
508
0
13 Jun 2021
Documenting Large Webtext Corpora: A Case Study on the Colossal Clean
  Crawled Corpus
Documenting Large Webtext Corpora: A Case Study on the Colossal Clean Crawled CorpusConference on Empirical Methods in Natural Language Processing (EMNLP), 2021
Jesse Dodge
Maarten Sap
Ana Marasović
William Agnew
Gabriel Ilharco
Dirk Groeneveld
Margaret Mitchell
Matt Gardner
AILaw
322
582
0
18 Apr 2021
VoxPopuli: A Large-Scale Multilingual Speech Corpus for Representation
  Learning, Semi-Supervised Learning and Interpretation
VoxPopuli: A Large-Scale Multilingual Speech Corpus for Representation Learning, Semi-Supervised Learning and InterpretationAnnual Meeting of the Association for Computational Linguistics (ACL), 2021
Changhan Wang
M. Rivière
Ann Lee
Anne Wu
Chaitanya Talnikar
Daniel Haziza
Mary Williamson
J. Pino
Emmanuel Dupoux
SSL
617
631
0
02 Jan 2021
MLS: A Large-Scale Multilingual Dataset for Speech Research
MLS: A Large-Scale Multilingual Dataset for Speech ResearchInterspeech (Interspeech), 2020
Vineel Pratap
Qiantong Xu
Anuroop Sriram
Gabriel Synnaeve
R. Collobert
AuLLM
679
681
0
07 Dec 2020
Large image datasets: A pyrrhic win for computer vision?
Large image datasets: A pyrrhic win for computer vision?IEEE Workshop/Winter Conference on Applications of Computer Vision (WACV), 2020
Vinay Uday Prabhu
Abeba Birhane
335
406
0
24 Jun 2020
Mitigating Gender Bias in Captioning Systems
Mitigating Gender Bias in Captioning Systems
Ruixiang Tang
Mengnan Du
Yuening Li
Zirui Liu
Na Zou
Helen Zhou
FaML
550
74
0
15 Jun 2020
Measuring Social Biases in Grounded Vision and Language Embeddings
Measuring Social Biases in Grounded Vision and Language EmbeddingsNorth American Chapter of the Association for Computational Linguistics (NAACL), 2020
Candace Ross
Boris Katz
Andrei Barbu
309
69
0
20 Feb 2020
Libri-Light: A Benchmark for ASR with Limited or No Supervision
Libri-Light: A Benchmark for ASR with Limited or No SupervisionIEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), 2019
Jacob Kahn
M. Rivière
Weiyi Zheng
Evgeny Kharitonov
Qiantong Xu
...
Tatiana Likhomanenko
Gabriel Synnaeve
Armand Joulin
Abdel-rahman Mohamed
Emmanuel Dupoux
AuLLM
417
771
0
17 Dec 2019
Common Voice: A Massively-Multilingual Speech Corpus
Common Voice: A Massively-Multilingual Speech CorpusInternational Conference on Language Resources and Evaluation (LREC), 2019
Rosana Ardila
Megan Branson
Kelly Davis
Michael Henretty
M. Kohler
Josh Meyer
Reuben Morais
Lindsay Saunders
Francis M. Tyers
Gregor Weber
VLM
379
2,112
0
13 Dec 2019
MELD: A Multimodal Multi-Party Dataset for Emotion Recognition in
  Conversations
MELD: A Multimodal Multi-Party Dataset for Emotion Recognition in Conversations
Soujanya Poria
Devamanyu Hazarika
Navonil Majumder
Gautam Naik
Xiaoshi Zhong
Amélie Reymond
666
1,402
0
05 Oct 2018
VoxCeleb2: Deep Speaker Recognition
VoxCeleb2: Deep Speaker Recognition
Joon Son Chung
Arsha Nagrani
Andrew Zisserman
704
2,591
0
14 Jun 2018
Datasheets for Datasets
Datasheets for Datasets
Timnit Gebru
Jamie Morgenstern
Briana Vecchione
Jennifer Wortman Vaughan
Hanna M. Wallach
Hal Daumé
Kate Crawford
1.2K
2,612
0
23 Mar 2018
VoxCeleb: a large-scale speaker identification dataset
VoxCeleb: a large-scale speaker identification dataset
Arsha Nagrani
Joon Son Chung
Andrew Zisserman
1.5K
2,567
0
26 Jun 2017
1
Page 1 of 1