ResearchTrend.AI
  • Communities
  • Connect sessions
  • AI calendar
  • Organizations
  • Join Slack
  • Contact Sales
Papers
Communities
Social Events
Terms and Conditions
Pricing
Contact Sales
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2026 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2104.02145
  4. Cited By
What Will it Take to Fix Benchmarking in Natural Language Understanding?
v1v2v3 (latest)

What Will it Take to Fix Benchmarking in Natural Language Understanding?

North American Chapter of the Association for Computational Linguistics (NAACL), 2021
5 April 2021
Samuel R. Bowman
George E. Dahl
    ELMALM
ArXiv (abs)PDFHTML

Papers citing "What Will it Take to Fix Benchmarking in Natural Language Understanding?"

50 / 125 papers shown
ATLAS: A High-Difficulty, Multidisciplinary Benchmark for Frontier Scientific Reasoning
ATLAS: A High-Difficulty, Multidisciplinary Benchmark for Frontier Scientific Reasoning
Hongwei Liu
J. Liu
Shudong Liu
Haodong Duan
Yuqiang Li
...
Conghui He
Qi Zhang
Songyang Zhang
Lei Bai
Kai Chen
LRMALMELM
536
2
0
18 Nov 2025
EvalCards: A Framework for Standardized Evaluation Reporting
EvalCards: A Framework for Standardized Evaluation Reporting
Ruchira Dhar
Danae Sanchez Villegas
Antonia Karamolegkou
Alice Schiavone
Yifei Yuan
...
Monorama Swain
Stephanie Brandl
Daniel Hershcovich
Anders Søgaard
Desmond Elliott
101
2
0
05 Nov 2025
Measuring what Matters: Construct Validity in Large Language Model Benchmarks
Measuring what Matters: Construct Validity in Large Language Model Benchmarks
Andrew M. Bean
Ryan Kearns
Angelika Romanou
Franziska Sofia Hafner
Harry Mayne
...
Christopher Summerfield
Philip Torr
Cozmin Ududec
Luc Rocher
Adam Mahdi
ALM
586
32
0
03 Nov 2025
Reward Models are Metrics in a Trench Coat
Reward Models are Metrics in a Trench Coat
Sebastian Gehrmann
189
0
0
03 Oct 2025
MEMTRACK: Evaluating Long-Term Memory and State Tracking in Multi-Platform Dynamic Agent Environments
MEMTRACK: Evaluating Long-Term Memory and State Tracking in Multi-Platform Dynamic Agent Environments
Darshan Deshpande
Varun Gangal
Hersh Mehta
Anand Kannappan
Rebecca Qian
Peng Wang
148
6
0
01 Oct 2025
Uncovering the Computational Ingredients of Human-Like Representations in LLMs
Uncovering the Computational Ingredients of Human-Like Representations in LLMs
Zach Studdiford
Timothy T. Rogers
Kushin Mukherjee
Siddharth Suresh
218
1
0
01 Oct 2025
KAIO: A Collection of More Challenging Korean Questions
KAIO: A Collection of More Challenging Korean Questions
Nahyun Lee
Guijin Son
Hyunwoo Ko
Kyubeen Han
ELMVLM
142
1
0
18 Sep 2025
Measuring Bias or Measuring the Task: Understanding the Brittle Nature of LLM Gender Biases
Measuring Bias or Measuring the Task: Understanding the Brittle Nature of LLM Gender Biases
Bufan Gao
Elisa Kreiss
261
3
0
04 Sep 2025
MoNaCo: More Natural and Complex Questions for Reasoning Across Dozens of Documents
MoNaCo: More Natural and Complex Questions for Reasoning Across Dozens of Documents
Tomer Wolfson
H. Trivedi
Mor Geva
Yoav Goldberg
Dan Roth
Tushar Khot
Ashish Sabharwal
Reut Tsarfaty
RALMLRM
350
16
0
15 Aug 2025
STREAM (ChemBio): A Standard for Transparently Reporting Evaluations in AI Model Reports
STREAM (ChemBio): A Standard for Transparently Reporting Evaluations in AI Model Reports
Tegan McCaslin
Jide Alaga
Samira Nedungadi
Seth Donoughe
Tom Reed
Rishi Bommasani
Chris Painter
Luca Righetti
362
6
0
13 Aug 2025
Automated Validation of LLM-based Evaluators for Software Engineering Artifacts
Automated Validation of LLM-based Evaluators for Software Engineering Artifacts
Ora Nova Fandina
E. Farchi
Shmulik Froimovich
Rami Katan
Alice Podolsky
Orna Raz
Avi Ziv
218
4
0
04 Aug 2025
From Queries to Criteria: Understanding How Astronomers Evaluate LLMs
From Queries to Criteria: Understanding How Astronomers Evaluate LLMs
Alina Hyk
Kiera McCormick
Mian Zhong
I. Ciucă
Sanjib Sharma
John F. Wu
J. E. G. Peek
K. Iyer
Ziang Xiao
Anjalie Field
217
4
0
21 Jul 2025
The SWE-Bench Illusion: When State-of-the-Art LLMs Remember Instead of Reason
The SWE-Bench Illusion: When State-of-the-Art LLMs Remember Instead of Reason
Shanchao Liang
Spandan Garg
Roshanak Zilouchian Moghaddam
461
17
0
14 Jun 2025
What Has Been Lost with Synthetic Evaluation?
What Has Been Lost with Synthetic Evaluation?
Alexander Gill
Abhilasha Ravichander
Ana Marasović
ELM
475
3
0
28 May 2025
Social Bias in Popular Question-Answering Benchmarks
Social Bias in Popular Question-Answering Benchmarks
Angelie Kraft
Judith Simon
Sonja Schimmler
526
4
0
21 May 2025
TRAIL: Trace Reasoning and Agentic Issue Localization
TRAIL: Trace Reasoning and Agentic Issue Localization
Darshan Deshpande
Varun Gangal
Hersh Mehta
Jitin Krishnan
Anand Kannappan
Rebecca Qian
686
29
0
13 May 2025
FinNLI: Novel Dataset for Multi-Genre Financial Natural Language Inference Benchmarking
FinNLI: Novel Dataset for Multi-Genre Financial Natural Language Inference BenchmarkingNorth American Chapter of the Association for Computational Linguistics (NAACL), 2025
Jabez Magomere
Elena Kochkina
Samuel Mensah
Simerjot Kaur
Charese Smiley
444
4
0
22 Apr 2025
Browsing Lost Unformed Recollections: A Benchmark for Tip-of-the-Tongue Search and Reasoning
Browsing Lost Unformed Recollections: A Benchmark for Tip-of-the-Tongue Search and ReasoningAnnual Meeting of the Association for Computational Linguistics (ACL), 2025
Sky CH-Wang
Darshan Deshpande
Smaranda Muresan
Anand Kannappan
Rebecca Qian
397
7
0
24 Mar 2025
RefactorBench: Evaluating Stateful Reasoning in Language Agents Through Code
RefactorBench: Evaluating Stateful Reasoning in Language Agents Through CodeInternational Conference on Learning Representations (ICLR), 2025
Dhruv Gautam
Spandan Garg
Jinu Jang
Neel Sundaresan
Roshanak Zilouchian Moghaddam
LLMAGLRM
356
21
0
10 Mar 2025
Toward an Evaluation Science for Generative AI Systems
Toward an Evaluation Science for Generative AI Systems
Laura Weidinger
Deb Raji
Hanna M. Wallach
Margaret Mitchell
Angelina Wang
Olawale Salaudeen
Rishi Bommasani
Sayash Kapoor
Deep Ganguli
Sanmi Koyejo
EGVMELM
462
38
0
07 Mar 2025
Correlating and Predicting Human Evaluations of Language Models from Natural Language Processing Benchmarks
Correlating and Predicting Human Evaluations of Language Models from Natural Language Processing Benchmarks
Rylan Schaeffer
Punit Singh Koura
Binh Tang
R. Subramanian
Aaditya K. Singh
...
Vedanuj Goswami
Sergey Edunov
Dieuwke Hupkes
Sanmi Koyejo
Sharan Narang
ALM
465
2
0
24 Feb 2025
Can We Trust AI Benchmarks? An Interdisciplinary Review of Current Issues in AI Evaluation
Can We Trust AI Benchmarks? An Interdisciplinary Review of Current Issues in AI Evaluation
Maria Eriksson
Erasmo Purificato
Arman Noroozian
Joao Vinagre
Guillaume Chaslot
Emilia Gomez
David Fernandez-Llorca
ELM
841
47
0
10 Feb 2025
Towards Effective Discrimination Testing for Generative AI
Towards Effective Discrimination Testing for Generative AIConference on Fairness, Accountability and Transparency (FAccT), 2024
Thomas P. Zollo
Nikita Rajaneesh
Richard Zemel
Talia B. Gillis
Emily Black
405
5
0
31 Dec 2024
The Vulnerability of Language Model Benchmarks: Do They Accurately
  Reflect True LLM Performance?
The Vulnerability of Language Model Benchmarks: Do They Accurately Reflect True LLM Performance?
Sourav Banerjee
Ayushi Agarwal
Eishkaran Singh
ELM
329
25
0
02 Dec 2024
Benchmark Data Repositories for Better Benchmarking
Benchmark Data Repositories for Better BenchmarkingNeural Information Processing Systems (NeurIPS), 2024
Rachel Longjohn
Markelle Kelly
Sameer Singh
Padhraic Smyth
307
15
0
31 Oct 2024
Leaving the barn door open for Clever Hans: Simple features predict LLM
  benchmark answers
Leaving the barn door open for Clever Hans: Simple features predict LLM benchmark answers
Lorenzo Pacchiardi
Marko Tesic
Lucy G. Cheke
José Hernández-Orallo
334
5
0
15 Oct 2024
Responsible AI in Open Ecosystems: Reconciling Innovation with Risk
  Assessment and Disclosure
Responsible AI in Open Ecosystems: Reconciling Innovation with Risk Assessment and Disclosure
Mahasweta Chakraborti
Bert Joseph Prestoza
Nicholas Vincent
Seth Frey
331
1
0
27 Sep 2024
Evaluating AI Evaluation: Perils and Prospects
Evaluating AI Evaluation: Perils and Prospects
John Burden
ELM
262
16
0
12 Jul 2024
RuBLiMP: Russian Benchmark of Linguistic Minimal Pairs
RuBLiMP: Russian Benchmark of Linguistic Minimal Pairs
Ekaterina Taktasheva
Maxim Bazhukov
Kirill Koncha
Alena Fenogenova
Ekaterina Artemova
Vladislav Mikhailov
370
24
0
27 Jun 2024
Statistical Uncertainty in Word Embeddings: GloVe-V
Statistical Uncertainty in Word Embeddings: GloVe-V
Andrea Vallebueno
Cassandra Handan-Nader
Christopher D. Manning
Daniel E. Ho
183
3
0
18 Jun 2024
ECBD: Evidence-Centered Benchmark Design for NLP
ECBD: Evidence-Centered Benchmark Design for NLP
Yu Lu Liu
Su Lin Blodgett
Jackie Chi Kit Cheung
Q. Vera Liao
Alexandra Olteanu
Ziang Xiao
381
24
0
13 Jun 2024
Making Task-Oriented Dialogue Datasets More Natural by Synthetically
  Generating Indirect User Requests
Making Task-Oriented Dialogue Datasets More Natural by Synthetically Generating Indirect User Requests
Amogh Mannekote
Jinseok Nam
Ziming Li
Jian Gao
K. Boyer
Bonnie J. Dorr
319
1
0
12 Jun 2024
Automated Evaluation of Retrieval-Augmented Language Models with
  Task-Specific Exam Generation
Automated Evaluation of Retrieval-Augmented Language Models with Task-Specific Exam Generation
Gauthier Guinet
Behrooz Omidvar-Tehrani
Hao Ding
Laurent Callot
RALM
310
35
0
22 May 2024
Real Risks of Fake Data: Synthetic Data, Diversity-Washing and Consent
  Circumvention
Real Risks of Fake Data: Synthetic Data, Diversity-Washing and Consent CircumventionConference on Fairness, Accountability and Transparency (FAccT), 2024
Cedric Deslandes Whitney
Justin Norman
317
55
0
03 May 2024
Inherent Trade-Offs between Diversity and Stability in Multi-Task
  Benchmarks
Inherent Trade-Offs between Diversity and Stability in Multi-Task BenchmarksInternational Conference on Machine Learning (ICML), 2024
Guanhua Zhang
Moritz Hardt
356
21
0
02 May 2024
Auxiliary task demands mask the capabilities of smaller language models
Auxiliary task demands mask the capabilities of smaller language models
Jennifer Hu
Michael C. Frank
ELM
415
58
0
03 Apr 2024
PATCH! {P}sychometrics-{A}ssis{T}ed Ben{CH}marking of Large Language Models against Human Populations: A Case Study of Proficiency in 8th Grade Mathematics
PATCH! {P}sychometrics-{A}ssis{T}ed Ben{CH}marking of Large Language Models against Human Populations: A Case Study of Proficiency in 8th Grade Mathematics
Qixiang Fang
Daniel L. Oberski
Dong Nguyen
526
1
0
02 Apr 2024
Dialogue with Robots: Proposals for Broadening Participation and
  Research in the SLIVAR Community
Dialogue with Robots: Proposals for Broadening Participation and Research in the SLIVAR Community
Casey Kennington
Malihe Alikhani
Heather Pon-Barry
Katherine Atwell
Yonatan Bisk
...
Jivko Sinapov
Angela Stewart
Matthew Stone
Stefanie Tellex
Tom Williams
343
1
0
01 Apr 2024
VariErr NLI: Separating Annotation Error from Human Label Variation
VariErr NLI: Separating Annotation Error from Human Label Variation
Leon Weber-Genzel
Siyao Peng
M. Marneffe
Barbara Plank
349
48
0
04 Mar 2024
Lifelong Benchmarks: Efficient Model Evaluation in an Era of Rapid
  Progress
Lifelong Benchmarks: Efficient Model Evaluation in an Era of Rapid Progress
Christian Schroeder de Witt
Vishaal Udandarao
Juil Sock
Matthias Bethge
Adel Bibi
Samuel Albanie
278
3
0
29 Feb 2024
Verifiable evaluations of machine learning models using zkSNARKs
Verifiable evaluations of machine learning models using zkSNARKs
Tobin South
Alexander Camuto
Shrey Jain
Shayla Nguyen
Robert Mahari
Christian Paquin
Jason Morton
Alex Pentland
MLAUALM
344
24
0
05 Feb 2024
Generating Zero-shot Abstractive Explanations for Rumour Verification
Generating Zero-shot Abstractive Explanations for Rumour Verification
I. Bilal
Preslav Nakov
Rob Procter
Maria Liakata
275
1
0
23 Jan 2024
How the Advent of Ubiquitous Large Language Models both Stymie and
  Turbocharge Dynamic Adversarial Question Generation
How the Advent of Ubiquitous Large Language Models both Stymie and Turbocharge Dynamic Adversarial Question Generation
Yoo Yeon Sung
Ishani Mondal
Jordan L. Boyd-Graber
292
1
0
20 Jan 2024
Collaboration or Corporate Capture? Quantifying NLP's Reliance on
  Industry Artifacts and Contributions
Collaboration or Corporate Capture? Quantifying NLP's Reliance on Industry Artifacts and Contributions
Will Aitken
Mohamed Abdalla
K. Rudie
Catherine Stinson
270
4
0
06 Dec 2023
Hashmarks: Privacy-Preserving Benchmarks for High-Stakes AI Evaluation
Hashmarks: Privacy-Preserving Benchmarks for High-Stakes AI Evaluation
P. Bricman
213
0
0
01 Dec 2023
GPQA: A Graduate-Level Google-Proof Q&A Benchmark
GPQA: A Graduate-Level Google-Proof Q&A Benchmark
David Rein
Betty Li Hou
Asa Cooper Stickland
Jackson Petty
Richard Yuanzhe Pang
Julien Dirani
Julian Michael
Samuel R. Bowman
AI4MHELM
589
2,282
0
20 Nov 2023
The Song Describer Dataset: a Corpus of Audio Captions for
  Music-and-Language Evaluation
The Song Describer Dataset: a Corpus of Audio Captions for Music-and-Language Evaluation
Ilaria Manco
Benno Weck
Seungheon Doh
Minz Won
Yixiao Zhang
...
Philip Tovstogan
Emmanouil Benetos
Elio Quinton
Gyorgy Fazekas
Juhan Nam
390
67
0
16 Nov 2023
Whispers of Doubt Amidst Echoes of Triumph in NLP Robustness
Whispers of Doubt Amidst Echoes of Triumph in NLP Robustness
Ashim Gupta
Rishanth Rajendhran
Nathan Stringham
Vivek Srikumar
Ana Marasović
AAML
321
8
0
16 Nov 2023
Show Your Work with Confidence: Confidence Bands for Tuning Curves
Show Your Work with Confidence: Confidence Bands for Tuning Curves
Nicholas Lourie
Kyunghyun Cho
He He
230
3
0
16 Nov 2023
Hallucination Augmented Recitations for Language Models
Hallucination Augmented Recitations for Language Models
Abdullatif Köksal
Renat Aksitov
Chung-Ching Chang
HILM
206
6
0
13 Nov 2023
123
Next
Page 1 of 3