AUDITA: A New Dataset to Audit Humans vs. AI Skill at Audio QA

23 April 2026

Tasnim Kabir

Dmytro Kurdydyk

Aadi Palnitkar

Liam Dorn

Ahmed Haj Ahmed

Jordan Lee Boyd-Graber

AuLLM

ArXiv (abs)PDF HTML Github

Main:8 Pages

4 Figures

Bibliography:6 Pages

20 Tables

Appendix:16 Pages

Abstract

Existing audio question answering benchmarks largely emphasize sound event classification or caption-grounded queries, often enabling models to succeed through shortcut strategies, short-duration cues, lexical priors, dataset-specific biases, or even bypassing audio via metadata and captions rather than genuine reasoning Thus, we present AUDITA (Audio Understanding from Diverse Internet Trivia Authors), a large-scale, real-world benchmark to rigorously evaluate audio reasoning beyond surface-level acoustic recognition. AUDITA comprises carefully curated, human-authored trivia questions grounded in real-world audio, designed to stress robust auditory reasoning through challenging distractors and long-range temporal dependencies, using probing queries that cannot be answered from isolated text or sound cues alone. Human average accuracy of 32.13% shows both the challenge of the task while demonstrating meaningful comprehension of the audio. In stark contrast, state of-the-art audio question answering models perform poorly, with average accuracy below 8.86%. Beyond raw accuracy, we apply Item Response Theory (IRT) to estimate latent proficiency, question difficulty, and expose systematic deficiencies of the models and data.

View on arXiv

Comments on this paper