v1v2v3 (latest)

What Will it Take to Fix Benchmarking in Natural Language Understanding?

North American Chapter of the Association for Computational Linguistics (NAACL), 2021

5 April 2021

Papers citing "What Will it Take to Fix Benchmarking in Natural Language Understanding?"

50 / 125 papers shown

ATLAS: A High-Difficulty, Multidisciplinary Benchmark for Frontier Scientific Reasoning

...

536

18 Nov 2025

EvalCards: A Framework for Standardized Evaluation Reporting

Ruchira Dhar

Danae Sanchez Villegas

...

101

05 Nov 2025

Measuring what Matters: Construct Validity in Large Language Model Benchmarks

Andrew M. Bean

Ryan Kearns

Angelika Romanou

Franziska Sofia Hafner

Harry Mayne

...

Christopher Summerfield

586

03 Nov 2025

Reward Models are Metrics in a Trench Coat

Sebastian Gehrmann

189

03 Oct 2025

MEMTRACK: Evaluating Long-Term Memory and State Tracking in Multi-Platform Dynamic Agent Environments

148

01 Oct 2025

Uncovering the Computational Ingredients of Human-Like Representations in LLMs

218

01 Oct 2025

KAIO: A Collection of More Challenging Korean Questions

142

18 Sep 2025

Measuring Bias or Measuring the Task: Understanding the Brittle Nature of LLM Gender Biases

Bufan Gao

Elisa Kreiss

261

04 Sep 2025

MoNaCo: More Natural and Complex Questions for Reasoning Across Dozens of Documents

350

15 Aug 2025

STREAM (ChemBio): A Standard for Transparently Reporting Evaluations in AI Model Reports

362

13 Aug 2025

Automated Validation of LLM-based Evaluators for Software Engineering Artifacts

218

04 Aug 2025

From Queries to Criteria: Understanding How Astronomers Evaluate LLMs

217

21 Jul 2025

The SWE-Bench Illusion: When State-of-the-Art LLMs Remember Instead of Reason

Shanchao Liang

Spandan Garg

Roshanak Zilouchian Moghaddam

461

14 Jun 2025

What Has Been Lost with Synthetic Evaluation?

Alexander Gill

Abhilasha Ravichander

Ana Marasović

ELM

475

28 May 2025

Social Bias in Popular Question-Answering Benchmarks

Angelie Kraft

Judith Simon

Sonja Schimmler

526

21 May 2025

TRAIL: Trace Reasoning and Agentic Issue Localization

686

13 May 2025

FinNLI: Novel Dataset for Multi-Genre Financial Natural Language Inference BenchmarkingNorth American Chapter of the Association for Computational Linguistics (NAACL), 2025

444

22 Apr 2025

Browsing Lost Unformed Recollections: A Benchmark for Tip-of-the-Tongue Search and ReasoningAnnual Meeting of the Association for Computational Linguistics (ACL), 2025

397

24 Mar 2025

RefactorBench: Evaluating Stateful Reasoning in Language Agents Through CodeInternational Conference on Learning Representations (ICLR), 2025

Roshanak Zilouchian Moghaddam

LLMAG LRM

356

10 Mar 2025

Toward an Evaluation Science for Generative AI Systems

462

07 Mar 2025

Correlating and Predicting Human Evaluations of Language Models from Natural Language Processing Benchmarks

...

465

24 Feb 2025

Can We Trust AI Benchmarks? An Interdisciplinary Review of Current Issues in AI Evaluation

David Fernandez-Llorca

ELM

841

10 Feb 2025

Towards Effective Discrimination Testing for Generative AIConference on Fairness, Accountability and Transparency (FAccT), 2024

405

31 Dec 2024

The Vulnerability of Language Model Benchmarks: Do They Accurately Reflect True LLM Performance?

329

02 Dec 2024

Benchmark Data Repositories for Better BenchmarkingNeural Information Processing Systems (NeurIPS), 2024

307

31 Oct 2024

Leaving the barn door open for Clever Hans: Simple features predict LLM benchmark answers

Lorenzo Pacchiardi

Marko Tesic

Lucy G. Cheke

José Hernández-Orallo

334

15 Oct 2024

Responsible AI in Open Ecosystems: Reconciling Innovation with Risk Assessment and Disclosure

Mahasweta Chakraborti

Bert Joseph Prestoza

Nicholas Vincent

Seth Frey

331

27 Sep 2024

Evaluating AI Evaluation: Perils and Prospects

John Burden

ELM

262

12 Jul 2024

RuBLiMP: Russian Benchmark of Linguistic Minimal Pairs

Ekaterina Artemova

370

27 Jun 2024

Statistical Uncertainty in Word Embeddings: GloVe-V

Andrea Vallebueno

Cassandra Handan-Nader

Christopher D. Manning

Daniel E. Ho

183

18 Jun 2024

ECBD: Evidence-Centered Benchmark Design for NLP

Yu Lu Liu

Su Lin Blodgett

Jackie Chi Kit Cheung

Q. Vera Liao

Alexandra Olteanu

Ziang Xiao

381

13 Jun 2024

Making Task-Oriented Dialogue Datasets More Natural by Synthetically Generating Indirect User Requests

319

12 Jun 2024

Automated Evaluation of Retrieval-Augmented Language Models with Task-Specific Exam Generation

Gauthier Guinet

Behrooz Omidvar-Tehrani

Hao Ding

Laurent Callot

RALM

310

22 May 2024

Real Risks of Fake Data: Synthetic Data, Diversity-Washing and Consent CircumventionConference on Fairness, Accountability and Transparency (FAccT), 2024

Cedric Deslandes Whitney

Justin Norman

317

03 May 2024

Inherent Trade-Offs between Diversity and Stability in Multi-Task BenchmarksInternational Conference on Machine Learning (ICML), 2024

Guanhua Zhang

Moritz Hardt

356

02 May 2024

Auxiliary task demands mask the capabilities of smaller language models

Jennifer Hu

Michael C. Frank

ELM

415

03 Apr 2024

PATCH! {P}sychometrics-{A}ssis{T}ed Ben{CH}marking of Large Language Models against Human Populations: A Case Study of Proficiency in 8th Grade Mathematics

Qixiang Fang

Daniel L. Oberski

Dong Nguyen

526

02 Apr 2024

Dialogue with Robots: Proposals for Broadening Participation and Research in the SLIVAR Community

...

343

01 Apr 2024

VariErr NLI: Separating Annotation Error from Human Label Variation

349

04 Mar 2024

Lifelong Benchmarks: Efficient Model Evaluation in an Era of Rapid Progress

Christian Schroeder de Witt

278

29 Feb 2024

Verifiable evaluations of machine learning models using zkSNARKs

344

05 Feb 2024

Generating Zero-shot Abstractive Explanations for Rumour Verification

275

23 Jan 2024

How the Advent of Ubiquitous Large Language Models both Stymie and Turbocharge Dynamic Adversarial Question Generation

Yoo Yeon Sung

Ishani Mondal

Jordan L. Boyd-Graber

292

20 Jan 2024

Collaboration or Corporate Capture? Quantifying NLP's Reliance on Industry Artifacts and Contributions

270

06 Dec 2023

Hashmarks: Privacy-Preserving Benchmarks for High-Stakes AI Evaluation

P. Bricman

213

01 Dec 2023

GPQA: A Graduate-Level Google-Proof Q&A Benchmark

589

2,282

20 Nov 2023

The Song Describer Dataset: a Corpus of Audio Captions for Music-and-Language Evaluation

Benno Weck

...

390

16 Nov 2023

Whispers of Doubt Amidst Echoes of Triumph in NLP Robustness

321

16 Nov 2023

Show Your Work with Confidence: Confidence Bands for Tuning Curves

Nicholas Lourie

Kyunghyun Cho

He He

230

16 Nov 2023

Hallucination Augmented Recitations for Language Models

206

13 Nov 2023