TEDI: Trustworthy and Ethical Dataset Indicators to Analyze and Compare Dataset Documentation

23 May 2025

Papers citing "TEDI: Trustworthy and Ethical Dataset Indicators to Analyze and Compare Dataset Documentation"

36 / 36 papers shown

Position: Measure Dataset Diversity, Don't Just Claim It

Dora Zhao

Jerone T. A. Andrews

Orestis Papakyriakopoulos

Alice Xiang

283

11 Jul 2024

A Standardized Machine-readable Dataset Documentation Format for Responsible AI

...

220

04 Jun 2024

YODAS: Youtube-Oriented Dataset for Audio and Speech

Shinji Watanabe

366

02 Jun 2024

Data Authenticity, Consent, & Provenance for AI are all broken: what will it take to fix them?

298

19 Apr 2024

Croissant: A Metadata Format for ML-Ready Datasets

...

333

28 Mar 2024

Fairness Feedback Loops: Training on Synthetic Data Amplifies BiasConference on Fairness, Accountability and Transparency (FAccT), 2024

Sierra Wyllie

Ilia Shumailov

Nicolas Papernot

245

12 Mar 2024

Navigating Dataset Documentations in AI: A Large-Scale Analysis of Dataset Cards on Hugging FaceInternational Conference on Learning Representations (ICLR), 2024

223

24 Jan 2024

Open Datasheets: Machine-readable Documentation for Open Datasets and Responsible AI Assessments

Anthony C. Roman

Jennifer Wortman Vaughan

239

11 Dec 2023

DMLR: Data-centric Machine Learning Research -- Past, Present and Future

Nezihe Merve Gürel

...

Lora Aroyo

273

21 Nov 2023

What's In My Big Data?

Yanai Elazar

Akshita Bhagia

Ian H. Magnusson

Abhilasha Ravichander

Dustin Schwenk

...

Luca Soldaini

242

124

31 Oct 2023

The Data Provenance Initiative: A Large Scale Audit of Dataset Licensing & Attribution in AI

Damien Sileo

...

Tongshuang Wu

336

25 Oct 2023

Libriheavy: a 50,000 hours ASR corpus with punctuation casing and contextIEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), 2023

Wei Kang

Xiaoyu Yang

Zengwei Yao

Fangjun Kuang

Yifan Yang

Liyong Guo

Long Lin

Daniel Povey

268

115

15 Sep 2023

Uncurated Image-Text Datasets: Shedding Light on Demographic BiasComputer Vision and Pattern Recognition (CVPR), 2023

202

06 Apr 2023

Ethical Considerations for Responsible Data CurationNeural Information Processing Systems (NeurIPS), 2023

Orestis Papakyriakopoulos

Alice Xiang

434

07 Feb 2023

Contrastive Language-Vision AI Models Pretrained on Web-Scraped Multimodal Data Exhibit Sexual Objectification BiasConference on Fairness, Accountability and Transparency (FAccT), 2022

337

21 Dec 2022

Angelina McMillan-Major

Douwe Kiela

242

09 Dec 2022

Gender Artifacts in Visual DatasetsIEEE International Conference on Computer Vision (ICCV), 2022

303

18 Jun 2022

CrowdWorkSheets: Accounting for Individual and Collective Identities Underlying Crowdsourced Dataset AnnotationConference on Fairness, Accountability and Transparency (FAccT), 2022

Vinodkumar Prabhakaran

Emily L. Denton

231

09 Jun 2022

FLEURS: Few-shot Learning Evaluation of Universal Representations of SpeechSpoken Language Technology Workshop (SLT), 2022

506

493

25 May 2022

Data Cards: Purposeful and Transparent Dataset Documentation for Responsible AIConference on Fairness, Accountability and Transparency (FAccT), 2022

252

273

03 Apr 2022

Representation Bias in Data: A Survey on Identification and Resolution TechniquesACM Computing Surveys (ACM CSUR), 2022

294

112

22 Mar 2022

The Dataset Nutrition Label (2nd Gen): Leveraging Context to Mitigate Harms in Artificial Intelligence

223

10 Jan 2022

Ego4D: Around the World in 3,000 Hours of Egocentric Video

...

Antonio Torralba

Mingfei Yan

1.0K

1,503

13 Oct 2021

GigaSpeech: An Evolving, Multi-domain ASR Corpus with 10,000 Hours of Transcribed AudioInterspeech (Interspeech), 2021

...

Yujun Wang

386

512

13 Jun 2021

Documenting Large Webtext Corpora: A Case Study on the Colossal Clean Crawled CorpusConference on Empirical Methods in Natural Language Processing (EMNLP), 2021

Dirk Groeneveld

326

582

18 Apr 2021

VoxPopuli: A Large-Scale Multilingual Speech Corpus for Representation Learning, Semi-Supervised Learning and InterpretationAnnual Meeting of the Association for Computational Linguistics (ACL), 2021

619

634

02 Jan 2021

MLS: A Large-Scale Multilingual Dataset for Speech ResearchInterspeech (Interspeech), 2020

679

689

07 Dec 2020

Large image datasets: A pyrrhic win for computer vision?IEEE Workshop/Winter Conference on Applications of Computer Vision (WACV), 2020

Vinay Uday Prabhu

Abeba Birhane

338

408

24 Jun 2020

Mitigating Gender Bias in Captioning Systems

551

15 Jun 2020

Measuring Social Biases in Grounded Vision and Language EmbeddingsNorth American Chapter of the Association for Computational Linguistics (NAACL), 2020

Candace Ross

Boris Katz

Andrei Barbu

312

20 Feb 2020

Libri-Light: A Benchmark for ASR with Limited or No SupervisionIEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), 2019

...

425

772

17 Dec 2019

Common Voice: A Massively-Multilingual Speech CorpusInternational Conference on Language Resources and Evaluation (LREC), 2019

388

2,121

13 Dec 2019

MELD: A Multimodal Multi-Party Dataset for Emotion Recognition in Conversations

669

1,413

05 Oct 2018

VoxCeleb2: Deep Speaker Recognition

Joon Son Chung

Arsha Nagrani

Andrew Zisserman

704

2,596

14 Jun 2018

Datasheets for Datasets

Timnit Gebru

Jamie Morgenstern

Briana Vecchione

Jennifer Wortman Vaughan

Hanna M. Wallach

Hal Daumé

Kate Crawford

1.2K

2,612

23 Mar 2018

VoxCeleb: a large-scale speaker identification dataset

Arsha Nagrani

Joon Son Chung

Andrew Zisserman

1.5K

2,571

26 Jun 2017