Papers
Communities
Events
Blog
Pricing
Search
Open menu
Home
Papers
2104.02145
Cited By
What Will it Take to Fix Benchmarking in Natural Language Understanding?
5 April 2021
Samuel R. Bowman
George E. Dahl
ELM
ALM
Re-assign community
ArXiv
PDF
HTML
Papers citing
"What Will it Take to Fix Benchmarking in Natural Language Understanding?"
32 / 32 papers shown
Title
TRAIL: Trace Reasoning and Agentic Issue Localization
Darshan Deshpande
Varun Gangal
Hersh Mehta
Jitin Krishnan
Anand Kannappan
Rebecca Qian
20
0
0
13 May 2025
Toward an Evaluation Science for Generative AI Systems
Laura Weidinger
Deb Raji
Hanna M. Wallach
Margaret Mitchell
Angelina Wang
Olawale Salaudeen
Rishi Bommasani
Sayash Kapoor
Deep Ganguli
Sanmi Koyejo
EGVM
ELM
62
3
0
07 Mar 2025
Towards Effective Discrimination Testing for Generative AI
Thomas P. Zollo
Nikita Rajaneesh
Richard Zemel
Talia B. Gillis
Emily Black
30
1
0
31 Dec 2024
Dialogue with Robots: Proposals for Broadening Participation and Research in the SLIVAR Community
Casey Kennington
Malihe Alikhani
Heather Pon-Barry
Katherine Atwell
Yonatan Bisk
...
Jivko Sinapov
Angela Stewart
Matthew Stone
Stefanie Tellex
Tom Williams
49
0
0
01 Apr 2024
Weisfeiler and Leman Go Measurement Modeling: Probing the Validity of the WL Test
Arjun Subramonian
Adina Williams
Maximilian Nickel
Yizhou Sun
Levent Sagun
16
0
0
11 Jul 2023
No Strong Feelings One Way or Another: Re-operationalizing Neutrality in Natural Language Inference
Animesh Nighojkar
Antonio Laverghetta
John Licato
23
4
0
16 Jun 2023
Revisiting Out-of-distribution Robustness in NLP: Benchmark, Analysis, and LLMs Evaluations
Lifan Yuan
Yangyi Chen
Ganqu Cui
Hongcheng Gao
Fangyuan Zou
Xingyi Cheng
Heng Ji
Zhiyuan Liu
Maosong Sun
32
72
0
07 Jun 2023
Language acquisition: do children and language models follow similar learning stages?
Linnea Evanson
Yair Lakretz
J. King
9
26
0
06 Jun 2023
PaLM 2 Technical Report
Rohan Anil
Andrew M. Dai
Orhan Firat
Melvin Johnson
Dmitry Lepikhin
...
Ce Zheng
Wei Zhou
Denny Zhou
Slav Petrov
Yonghui Wu
ReLM
LRM
58
1,138
0
17 May 2023
Design Choices for Crowdsourcing Implicit Discourse Relations: Revealing the Biases Introduced by Task Design
Valentina Pyatkin
Frances Yung
Merel C. J. Scholman
Reut Tsarfaty
Ido Dagan
Vera Demberg
19
12
0
03 Apr 2023
Evaluation for Change
Rishi Bommasani
ELM
30
0
0
20 Dec 2022
Validating Large Language Models with ReLM
Michael Kuchnik
Virginia Smith
George Amvrosiadis
8
27
0
21 Nov 2022
TestAug: A Framework for Augmenting Capability-based NLP Tests
Guanqun Yang
Mirazul Haque
Qiaochu Song
Wei Yang
Xueqing Liu
ELM
23
0
0
14 Oct 2022
Evaluate & Evaluation on the Hub: Better Best Practices for Data and Model Measurements
Leandro von Werra
Lewis Tunstall
A. Thakur
A. Luccioni
Tristan Thrush
...
Julien Chaumond
Margaret Mitchell
Alexander M. Rush
Thomas Wolf
Douwe Kiela
ELM
14
24
0
30 Sep 2022
Making Intelligence: Ethical Values in IQ and ML Benchmarks
Borhane Blili-Hamelin
Leif Hancox-Li
17
16
0
01 Sep 2022
KGxBoard: Explainable and Interactive Leaderboard for Evaluation of Knowledge Graph Completion Models
Haris Widjaja
Kiril Gashteovski
Wiem Ben-Rim
Pengfei Liu
Christopher Malon
Daniel Ruffinelli
Carolin (Haas) Lawrence
Graham Neubig
17
5
0
23 Aug 2022
ShortcutLens: A Visual Analytics Approach for Exploring Shortcuts in Natural Language Understanding Dataset
Zhihua Jin
Xingbo Wang
Furui Cheng
Chunhui Sun
Qun Liu
Huamin Qu
32
9
0
17 Aug 2022
FairLex: A Multilingual Benchmark for Evaluating Fairness in Legal Text Processing
Ilias Chalkidis
Tommaso Pasini
Shenmin Zhang
Letizia Tomada
Sebastian Felix Schwemer
Anders Søgaard
AILaw
27
54
0
14 Mar 2022
What Makes Reading Comprehension Questions Difficult?
Saku Sugawara
Nikita Nangia
Alex Warstadt
Sam Bowman
ELM
RALM
12
13
0
12 Mar 2022
Mapping global dynamics of benchmark creation and saturation in artificial intelligence
Simon Ott
A. Barbosa-Silva
Kathrin Blagec
J. Brauner
Matthias Samwald
22
36
0
09 Mar 2022
Language technology practitioners as language managers: arbitrating data bias and predictive bias in ASR
Nina Markl
S. McNulty
12
9
0
25 Feb 2022
Trust in AI: Interpretability is not necessary or sufficient, while black-box interaction is necessary and sufficient
Max W. Shen
17
18
0
10 Feb 2022
COPA-SSE: Semi-structured Explanations for Commonsense Reasoning
Ana Brassard
Benjamin Heinzerling
Pride Kavumba
Kentaro Inui
FAtt
LRM
13
10
0
18 Jan 2022
WANLI: Worker and AI Collaboration for Natural Language Inference Dataset Creation
Alisa Liu
Swabha Swayamdipta
Noah A. Smith
Yejin Choi
30
212
0
16 Jan 2022
Models in the Loop: Aiding Crowdworkers with Generative Annotation Assistants
Max Bartolo
Tristan Thrush
Sebastian Riedel
Pontus Stenetorp
Robin Jia
Douwe Kiela
17
33
0
16 Dec 2021
QuALITY: Question Answering with Long Input Texts, Yes!
Richard Yuanzhe Pang
Alicia Parrish
Nitish Joshi
Nikita Nangia
Jason Phang
...
Vishakh Padmakumar
Johnny Ma
Jana Thompson
He He
Sam Bowman
RALM
25
141
0
16 Dec 2021
Scaling Up Influence Functions
Andrea Schioppa
Polina Zablotskaia
David Vilar
Artem Sokolov
TDI
19
90
0
06 Dec 2021
Dyna-bAbI: unlocking bAbI's potential with dynamic synthetic benchmarking
Ronen Tamari
Kyle Richardson
Aviad Sar-Shalom
Noam Kahlon
Nelson F. Liu
Reut Tsarfaty
Dafna Shahaf
28
5
0
30 Nov 2021
CREAK: A Dataset for Commonsense Reasoning over Entity Knowledge
Yasumasa Onoe
Michael J.Q. Zhang
Eunsol Choi
Greg Durrett
HILM
21
85
0
03 Sep 2021
Do Natural Language Explanations Represent Valid Logical Arguments? Verifying Entailment in Explainable NLI Gold Standards
Marco Valentino
Ian Pratt-Hartman
André Freitas
XAI
LRM
21
12
0
05 May 2021
With Little Power Comes Great Responsibility
Dallas Card
Peter Henderson
Urvashi Khandelwal
Robin Jia
Kyle Mahowald
Dan Jurafsky
225
115
0
13 Oct 2020
GLUE: A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding
Alex Jinpeng Wang
Amanpreet Singh
Julian Michael
Felix Hill
Omer Levy
Samuel R. Bowman
ELM
294
6,943
0
20 Apr 2018
1