Data and its (dis)contents: A survey of dataset development and use in machine learning research

9 December 2020

Amandalynne Paullada

Inioluwa Deborah Raji

Papers citing "Data and its (dis)contents: A survey of dataset development and use in machine learning research"

50 / 225 papers shown

Beyond Data Filtering: Knowledge Localization for Capability Removal in LLMs

Igor Shilov

Alex Cloud

Aryo Pradipta Gema

Jacob Goldman-Wetzler

05 Dec 2025

Culture Affordance Atlas: Reconciling Object Diversity Through Functional Mapping

117

02 Dec 2025

Synthetic Data: AI's New Weapon Against Android Malware

Angelo Gaspar Diniz Nogueira

180

24 Nov 2025

Bias in, Bias out: Annotation Bias in Multilingual Large Language Models

Xia Cui

Ziyi Huang

Naeemeh Adel

109

18 Nov 2025

AgentExpt: Automating AI Experiment Design with LLM-based Resource Retrieval Agent

234

07 Nov 2025

Agent Data Protocol: Unifying Datasets for Diverse, Effective Fine-tuning of LLM Agents

...

231

28 Oct 2025

The Benchmarking Epistemology: Construct Validity for Evaluating Machine Learning Models

Timo Freiesleben

Sebastian Zezulka

154

27 Oct 2025

Data as a Lever: A Neighbouring Datasets Perspective on Predictive Multiplicity

Prakhar Ganesh

Hsiang Hsu

G. Farnadi

148

24 Oct 2025

Identity-Aware Large Language Models require Cultural Reasoning

140

21 Oct 2025

The Digital Mirror: Gender Bias and Occupational Stereotypes in AI-Generated Images

Siiri Leppälampi

Sonja M. Hyrynsalmi

Erno Vanhala

124

08 Oct 2025

RAISE: A Robot-Assisted Selective Disassembly and Sorting System for End-of-Life Phones

Chang Liu

Badrinath Balasubramaniam

109

27 Sep 2025

LABELING COPILOT: A Deep Research Agent for Automated Data Curation in Computer Vision

Debargha Ganguly

Sumit Kumar

Ishwar B Balappanawar

Weicong Chen

Shashank Kambhatla

Srinivasan Iyengar

Shivkumar Kalyanaraman

Ponnurangam Kumaraguru

Vipin Chaudhary

VLM

244

26 Sep 2025

Are You Sure You're Positive? Consolidating Chain-of-Thought Agents with Uncertainty Quantification for Aspect-Category Sentiment Analysis

100

24 Aug 2025

Beyond Internal Data: Bounding and Estimating Fairness from Incomplete Data

167

18 Aug 2025

Advancing Data Equity: Practitioner Responsibility and Accountability in NLP Data Practices

173

13 Aug 2025

Hallucination to Truth: A Review of Fact-Checking and Factuality Evaluation in Large Language Models

254

05 Aug 2025

Beyond Internal Data: Constructing Complete Datasets for Fairness Testing

195

24 Jul 2025

PARALLELPROMPT: Extracting Parallelism from Large Language Model Queries

211

23 Jun 2025

A Common Pool of Privacy Problems: Legal and Technical Lessons from a Large-Scale Web-Scraped Machine Learning Dataset

450

20 Jun 2025

AI Data Development: A Scorecard for the System Card Framework

Tadesse K. Bahiru

Haileleol Tibebu

Ioannis A. Kakadiaris

228

02 Jun 2025

MObyGaze: a film dataset of multimodal objectification densely annotated by experts

...

177

28 May 2025

We Need to Measure Data Diversity in NLP -- Better and Broader

Dong Nguyen

Esther Ploeger

392

26 May 2025

Social Bias in Popular Question-Answering Benchmarks

Angelie Kraft

Judith Simon

Sonja Schimmler

525

21 May 2025

Deepfakes on Demand: the rise of accessible non-consensual deepfake image generatorsConference on Fairness, Accountability and Transparency (FAccT), 2025

899

06 May 2025

Investigating the Capabilities and Limitations of Machine Learning for Identifying Bias in English Language Data with Information and Heritage ProfessionalsInternational Conference on Human Factors in Computing Systems (CHI), 2025

Lucy Havens

Benjamin Bach

Melissa Mhairi Terras

Beatrice Alex

329

01 Apr 2025

Toward an Evaluation Science for Generative AI Systems

453

07 Mar 2025

MONSTER: Monash Scalable Time Series Evaluation Repository

Angus Dempster

Navid Mohammadi Foumani

359

24 Feb 2025

Can We Trust AI Benchmarks? An Interdisciplinary Review of Current Issues in AI Evaluation

David Fernandez-Llorca

ELM

831

10 Feb 2025

Large Multimodal Models for Low-Resource Languages: A Survey

491

08 Feb 2025

Authenticated Delegation and Authorized AI Agents

Cedric Deslandes Whitney

Dazza Greenwood

Alan Chan

Alex Pentland

481

17 Jan 2025

Surveying Attitudinal Alignment Between Large Language Models Vs. Humans Towards 17 Sustainable Development Goals

...

420

17 Jan 2025

The Evolution of LLM Adoption in Industry Data Curation Practices

394

20 Dec 2024

The Evolution and Future Perspectives of Artificial Intelligence Generated Content

468

02 Dec 2024

ICT: Image-Object Cross-Level Trusted Intervention for Mitigating Object Hallucination in Large Vision-Language ModelsComputer Vision and Pattern Recognition (CVPR), 2024

1.1K

22 Nov 2024

Value Imprint: A Technique for Auditing the Human Values Embedded in RLHF DatasetsNeural Information Processing Systems (NeurIPS), 2024

Ike Obi

Rohan Pant

Srishti Shekhar Agrawal

Maham Ghazanfar

Aaron Basiletti

309

18 Nov 2024

A Systematic Review of NeurIPS Dataset Management PracticesNeural Information Processing Systems (NeurIPS), 2024

269

31 Oct 2024

Benchmark Data Repositories for Better BenchmarkingNeural Information Processing Systems (NeurIPS), 2024

298

31 Oct 2024

Public Domain 12M: A Highly Aesthetic Image-Text Dataset with Novel Governance Mechanisms

235

30 Oct 2024

Ethics Whitepaper: Whitepaper on Ethical Research into Large Language Models

371

17 Oct 2024

Sound Check: Auditing Audio Datasets

William Agnew

Harry H. Jiang

Sauvik Das

409

17 Oct 2024

Evaluating Cultural Awareness of LLMs for Yoruba, Malayalam, and English

327

14 Sep 2024

Introducing MeMo: A Multimodal Dataset for Memory Modelling in Multiparty Conversations

361

07 Sep 2024

Building Better Datasets: Seven Recommendations for Responsible Design from Dataset Creators

Will Orr

Kate Crawford

253

30 Aug 2024

The Problems with Proxies: Making Data Work Visible through Requester PracticesAAAI/ACM Conference on AI, Ethics, and Society (AIES), 2024

Annabel Rothschild

Ding Wang

Niveditha Jayakumar Vilvanathan

Lauren Wilcox

Carl Disalvo

Betsy Disalvo

227

21 Aug 2024

AI Research is not Magic, it has to be Reproducible and Responsible: Challenges in the AI field from the Perspective of its PhD Students

Andrea Hrckova

Jennifer Renoux

Rafael Tolosana Calasanz

Daniela Chuda

Martin Tamajka

Jakub Simko

147

13 Aug 2024

The Data Addition DilemmaMachine Learning in Health Care (MLHC), 2024

Judy Hanwen Shen

Inioluwa Deborah Raji

Irene Y. Chen

366

08 Aug 2024

To which reference class do you belong? Measuring racial fairness of reference classes with normative modeling

Christian F. Beckmann

H. Ruhé

A. Marquand

CML

473

26 Jul 2024

Consent in Crisis: The Rapid Decline of the AI Data Commons

...

450

20 Jul 2024

Is That Rain? Understanding Effects on Visual Odometry Performance for Autonomous UAVs and Efficient DNN-based Rain Classification at the Edge

422

17 Jul 2024

Position: Measure Dataset Diversity, Don't Just Claim It

Dora Zhao

Jerone T. A. Andrews

Orestis Papakyriakopoulos

Alice Xiang

336

11 Jul 2024