ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2104.14337
  4. Cited By
Dynabench: Rethinking Benchmarking in NLP

Dynabench: Rethinking Benchmarking in NLP

7 April 2021
Douwe Kiela
Max Bartolo
Yixin Nie
Divyansh Kaushik
Atticus Geiger
Zhengxuan Wu
Bertie Vidgen
Grusha Prasad
Amanpreet Singh
Pratik Ringshia
Zhiyi Ma
Tristan Thrush
Sebastian Riedel
Zeerak Talat
Pontus Stenetorp
Robin Jia
Mohit Bansal
Christopher Potts
Adina Williams
ArXivPDFHTML

Papers citing "Dynabench: Rethinking Benchmarking in NLP"

50 / 93 papers shown
Title
What's the Difference? Supporting Users in Identifying the Effects of Prompt and Model Changes Through Token Patterns
What's the Difference? Supporting Users in Identifying the Effects of Prompt and Model Changes Through Token Patterns
Michael A. Hedderich
Anyi Wang
Raoyuan Zhao
Florian Eichin
Barbara Plank
30
0
0
22 Apr 2025
Virology Capabilities Test (VCT): A Multimodal Virology Q&A Benchmark
Virology Capabilities Test (VCT): A Multimodal Virology Q&A Benchmark
Jasper Götting
Pedro Medeiros
Jon G Sanders
Nathaniel Li
Long Phan
Karam Elabd
Lennart Justen
Dan Hendrycks
Seth Donoughe
ELM
49
2
0
21 Apr 2025
Findings of the BabyLM Challenge: Sample-Efficient Pretraining on Developmentally Plausible Corpora
Findings of the BabyLM Challenge: Sample-Efficient Pretraining on Developmentally Plausible Corpora
Alex Warstadt
Aaron Mueller
Leshem Choshen
E. Wilcox
Chengxu Zhuang
...
Rafael Mosquera
Bhargavi Paranjape
Adina Williams
Tal Linzen
Ryan Cotterell
38
107
0
10 Apr 2025
ZeroSumEval: An Extensible Framework For Scaling LLM Evaluation with Inter-Model Competition
ZeroSumEval: An Extensible Framework For Scaling LLM Evaluation with Inter-Model Competition
H. A. Alyahya
Haidar Khan
Yazeed Alnumay
M Saiful Bari
B. Yener
LRM
65
1
0
10 Mar 2025
Toward an Evaluation Science for Generative AI Systems
Laura Weidinger
Deb Raji
Hanna M. Wallach
Margaret Mitchell
Angelina Wang
Olawale Salaudeen
Rishi Bommasani
Sayash Kapoor
Deep Ganguli
Sanmi Koyejo
EGVM
ELM
65
4
0
07 Mar 2025
Position: Model Collapse Does Not Mean What You Think
Position: Model Collapse Does Not Mean What You Think
Rylan Schaeffer
Joshua Kazdan
Alvan Caleb Arulandu
Sanmi Koyejo
60
0
0
05 Mar 2025
Varco Arena: A Tournament Approach to Reference-Free Benchmarking Large Language Models
Varco Arena: A Tournament Approach to Reference-Free Benchmarking Large Language Models
Seonil Son
Ju-Min Oh
Heegon Jin
Cheolhun Jang
Jeongbeom Jeong
Kuntae Kim
44
0
0
20 Feb 2025
MathGAP: Out-of-Distribution Evaluation on Problems with Arbitrarily Complex Proofs
MathGAP: Out-of-Distribution Evaluation on Problems with Arbitrarily Complex Proofs
Andreas Opedal
Haruki Shirakami
Bernhard Schölkopf
Abulhair Saparov
Mrinmaya Sachan
LRM
57
1
0
17 Feb 2025
Unleashing the Power of Data Tsunami: A Comprehensive Survey on Data Assessment and Selection for Instruction Tuning of Language Models
Unleashing the Power of Data Tsunami: A Comprehensive Survey on Data Assessment and Selection for Instruction Tuning of Language Models
Yulei Qin
Yuncheng Yang
Pengcheng Guo
Gang Li
Hang Shao
Yuchen Shi
Zihan Xu
Yun Gu
Ke Li
Xing Sun
ALM
90
12
0
31 Dec 2024
BALROG: Benchmarking Agentic LLM and VLM Reasoning On Games
BALROG: Benchmarking Agentic LLM and VLM Reasoning On Games
Davide Paglieri
Bartłomiej Cupiał
Samuel Coward
Ulyana Piterbarg
Maciej Wolczyk
...
Lerrel Pinto
Rob Fergus
Jakob Foerster
Jack Parker-Holder
Tim Rocktaschel
LLMAG
LRM
108
10
0
20 Nov 2024
Improving Model Evaluation using SMART Filtering of Benchmark Datasets
Improving Model Evaluation using SMART Filtering of Benchmark Datasets
Vipul Gupta
Candace Ross
David Pantoja
R. Passonneau
Megan Ung
Adina Williams
70
1
0
26 Oct 2024
Enabling Scalable Evaluation of Bias Patterns in Medical LLMs
Enabling Scalable Evaluation of Bias Patterns in Medical LLMs
Hamed Fayyaz
Raphael Poulain
Rahmatollah Beheshti
32
1
0
18 Oct 2024
NaturalBench: Evaluating Vision-Language Models on Natural Adversarial Samples
NaturalBench: Evaluating Vision-Language Models on Natural Adversarial Samples
Baiqi Li
Zhiqiu Lin
Wenxuan Peng
Jean de Dieu Nyandwi
Daniel Jiang
Zixian Ma
Simran Khanuja
Ranjay Krishna
Graham Neubig
Deva Ramanan
AAML
CoGe
VLM
69
21
0
18 Oct 2024
MIRAGE-Bench: Automatic Multilingual Benchmark Arena for Retrieval-Augmented Generation Systems
MIRAGE-Bench: Automatic Multilingual Benchmark Arena for Retrieval-Augmented Generation Systems
Nandan Thakur
Suleman Kazi
Ge Luo
Jimmy J. Lin
Amin Ahmad
VLM
RALM
28
7
0
17 Oct 2024
SYNTHEVAL: Hybrid Behavioral Testing of NLP Models with Synthetic
  CheckLists
SYNTHEVAL: Hybrid Behavioral Testing of NLP Models with Synthetic CheckLists
Raoyuan Zhao
Abdullatif Köksal
Yihong Liu
Leonie Weissweiler
Anna Korhonen
Hinrich Schütze
SyDa
36
1
0
30 Aug 2024
K-Sort Arena: Efficient and Reliable Benchmarking for Generative Models via K-wise Human Preferences
K-Sort Arena: Efficient and Reliable Benchmarking for Generative Models via K-wise Human Preferences
Zhikai Li
Xuewen Liu
Dongrong Fu
Jianquan Li
Qingyi Gu
Kurt Keutzer
Zhen Dong
EGVM
VGen
DiffM
83
1
0
26 Aug 2024
Extrinsic Evaluation of Cultural Competence in Large Language Models
Extrinsic Evaluation of Cultural Competence in Large Language Models
Shaily Bhatt
Fernando Diaz
ELM
EGVM
47
4
0
17 Jun 2024
HelloFresh: LLM Evaluations on Streams of Real-World Human Editorial
  Actions across X Community Notes and Wikipedia edits
HelloFresh: LLM Evaluations on Streams of Real-World Human Editorial Actions across X Community Notes and Wikipedia edits
Tim Franzmeyer
Aleksandar Shtedritski
Samuel Albanie
Philip H. S. Torr
João F. Henriques
Jakob N. Foerster
27
1
0
05 Jun 2024
What is it for a Machine Learning Model to Have a Capability?
What is it for a Machine Learning Model to Have a Capability?
Jacqueline Harding
Nathaniel Sharadin
ELM
36
3
0
14 May 2024
Adversarial Nibbler: An Open Red-Teaming Method for Identifying Diverse
  Harms in Text-to-Image Generation
Adversarial Nibbler: An Open Red-Teaming Method for Identifying Diverse Harms in Text-to-Image Generation
Jessica Quaye
Alicia Parrish
Oana Inel
Charvi Rastogi
Hannah Rose Kirk
...
Nathan Clement
Rafael Mosquera
Juan Ciro
Vijay Janapa Reddi
Lora Aroyo
31
7
0
14 Feb 2024
How the Advent of Ubiquitous Large Language Models both Stymie and
  Turbocharge Dynamic Adversarial Question Generation
How the Advent of Ubiquitous Large Language Models both Stymie and Turbocharge Dynamic Adversarial Question Generation
Yoo Yeon Sung
Ishani Mondal
Jordan L. Boyd-Graber
28
0
0
20 Jan 2024
Deep Emotions Across Languages: A Novel Approach for Sentiment
  Propagation in Multilingual WordNets
Deep Emotions Across Languages: A Novel Approach for Sentiment Propagation in Multilingual WordNets
Jan Kocoñ
22
5
0
07 Dec 2023
Latent Feature-based Data Splits to Improve Generalisation Evaluation: A
  Hate Speech Detection Case Study
Latent Feature-based Data Splits to Improve Generalisation Evaluation: A Hate Speech Detection Case Study
Maike Zufle
Verna Dankers
Ivan Titov
31
0
0
16 Nov 2023
Toward Stronger Textual Attack Detectors
Toward Stronger Textual Attack Detectors
Pierre Colombo
Marine Picot
Nathan Noiry
Guillaume Staerman
Pablo Piantanida
38
5
0
21 Oct 2023
Anchor Points: Benchmarking Models with Much Fewer Examples
Anchor Points: Benchmarking Models with Much Fewer Examples
Rajan Vivek
Kawin Ethayarajh
Diyi Yang
Douwe Kiela
ALM
29
21
0
14 Sep 2023
No Train No Gain: Revisiting Efficient Training Algorithms For
  Transformer-based Language Models
No Train No Gain: Revisiting Efficient Training Algorithms For Transformer-based Language Models
Jean Kaddour
Oscar Key
Piotr Nawrot
Pasquale Minervini
Matt J. Kusner
17
41
0
12 Jul 2023
Which Spurious Correlations Impact Reasoning in NLI Models? A Visual
  Interactive Diagnosis through Data-Constrained Counterfactuals
Which Spurious Correlations Impact Reasoning in NLI Models? A Visual Interactive Diagnosis through Data-Constrained Counterfactuals
Robin Shing Moon Chan
Afra Amini
Mennatallah El-Assady
LRM
AAML
29
2
0
21 Jun 2023
Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena
Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena
Lianmin Zheng
Wei-Lin Chiang
Ying Sheng
Siyuan Zhuang
Zhanghao Wu
...
Dacheng Li
Eric P. Xing
Haotong Zhang
Joseph E. Gonzalez
Ion Stoica
ALM
OSLM
ELM
16
3,781
0
09 Jun 2023
From Adversarial Arms Race to Model-centric Evaluation: Motivating a
  Unified Automatic Robustness Evaluation Framework
From Adversarial Arms Race to Model-centric Evaluation: Motivating a Unified Automatic Robustness Evaluation Framework
Yangyi Chen
Hongcheng Gao
Ganqu Cui
Lifan Yuan
Dehan Kong
...
Longtao Huang
H. Xue
Zhiyuan Liu
Maosong Sun
Heng Ji
AAML
ELM
25
6
0
29 May 2023
Collaborative Development of NLP models
Collaborative Development of NLP models
Fereshte Khani
Marco Tulio Ribeiro
25
2
0
20 May 2023
What's the Meaning of Superhuman Performance in Today's NLU?
What's the Meaning of Superhuman Performance in Today's NLU?
Simone Tedeschi
Johan Bos
T. Declerck
Jan Hajic
Daniel Hershcovich
...
Simon Krek
Steven Schockaert
Rico Sennrich
Ekaterina Shutova
Roberto Navigli
ELM
LM&MA
VLM
ReLM
LRM
34
26
0
15 May 2023
LINGO : Visually Debiasing Natural Language Instructions to Support Task
  Diversity
LINGO : Visually Debiasing Natural Language Instructions to Support Task Diversity
Anjana Arunkumar
Shubham Sharma
Rakhi Agrawal
Sriramakrishnan Chandrasekaran
Chris Bryan
26
0
0
12 Apr 2023
Angler: Helping Machine Translation Practitioners Prioritize Model
  Improvements
Angler: Helping Machine Translation Practitioners Prioritize Model Improvements
Samantha Robertson
Zijie J. Wang
Dominik Moritz
Mary Beth Kery
Fred Hohman
30
15
0
12 Apr 2023
The Vector Grounding Problem
The Vector Grounding Problem
Dimitri Coelho Mollo
Raphael Milliere
36
26
0
04 Apr 2023
SemEval-2023 Task 10: Explainable Detection of Online Sexism
SemEval-2023 Task 10: Explainable Detection of Online Sexism
Hannah Rose Kirk
Wenjie Yin
Bertie Vidgen
Paul Röttger
10
117
0
07 Mar 2023
Auditing large language models: a three-layered approach
Auditing large language models: a three-layered approach
Jakob Mokander
Jonas Schuett
Hannah Rose Kirk
Luciano Floridi
AILaw
MLAU
42
194
0
16 Feb 2023
Backdoor Learning for NLP: Recent Advances, Challenges, and Future
  Research Directions
Backdoor Learning for NLP: Recent Advances, Challenges, and Future Research Directions
Marwan Omar
SILM
AAML
25
20
0
14 Feb 2023
Evaluation for Change
Evaluation for Change
Rishi Bommasani
ELM
38
0
0
20 Dec 2022
Evaluating Human-Language Model Interaction
Evaluating Human-Language Model Interaction
Mina Lee
Megha Srivastava
Amelia Hardy
John Thickstun
Esin Durmus
...
Hancheng Cao
Tony Lee
Rishi Bommasani
Michael S. Bernstein
Percy Liang
LM&MA
ALM
46
99
0
19 Dec 2022
Unnatural Instructions: Tuning Language Models with (Almost) No Human
  Labor
Unnatural Instructions: Tuning Language Models with (Almost) No Human Labor
Or Honovich
Thomas Scialom
Omer Levy
Timo Schick
ALM
31
359
0
19 Dec 2022
CiteBench: A benchmark for Scientific Citation Text Generation
CiteBench: A benchmark for Scientific Citation Text Generation
Martin Funkquist
Ilia Kuznetsov
Yufang Hou
Iryna Gurevych
15
16
0
19 Dec 2022
Graph Learning Indexer: A Contributor-Friendly and Metadata-Rich
  Platform for Graph Learning Benchmarks
Graph Learning Indexer: A Contributor-Friendly and Metadata-Rich Platform for Graph Learning Benchmarks
Jiaqi Ma
Xingjian Zhang
Hezheng Fan
Jin Huang
Tianyue Li
Tinghong Li
Yiwen Tu
Chen Zhu
Qiaozhu Mei
35
5
0
08 Dec 2022
Validating Large Language Models with ReLM
Validating Large Language Models with ReLM
Michael Kuchnik
Virginia Smith
George Amvrosiadis
21
27
0
21 Nov 2022
GLUE-X: Evaluating Natural Language Understanding Models from an
  Out-of-distribution Generalization Perspective
GLUE-X: Evaluating Natural Language Understanding Models from an Out-of-distribution Generalization Perspective
Linyi Yang
Shuibai Zhang
Libo Qin
Yafu Li
Yidong Wang
Hanmeng Liu
Jindong Wang
Xingxu Xie
Yue Zhang
ELM
39
79
0
15 Nov 2022
NaturalAdversaries: Can Naturalistic Adversaries Be as Effective as
  Artificial Adversaries?
NaturalAdversaries: Can Naturalistic Adversaries Be as Effective as Artificial Adversaries?
Saadia Gabriel
Hamid Palangi
Yejin Choi
AAML
37
1
0
08 Nov 2022
TCAB: A Large-Scale Text Classification Attack Benchmark
TCAB: A Large-Scale Text Classification Attack Benchmark
Kalyani Asthana
Zhouhang Xie
Wencong You
Adam Noack
Jonathan Brophy
Sameer Singh
Daniel Lowd
24
3
0
21 Oct 2022
Data-Efficient Strategies for Expanding Hate Speech Detection into
  Under-Resourced Languages
Data-Efficient Strategies for Expanding Hate Speech Detection into Under-Resourced Languages
Paul Röttger
Debora Nozza
Federico Bianchi
Dirk Hovy
23
10
0
20 Oct 2022
Understanding Practices, Challenges, and Opportunities for User-Engaged
  Algorithm Auditing in Industry Practice
Understanding Practices, Challenges, and Opportunities for User-Engaged Algorithm Auditing in Industry Practice
Wesley Hanwen Deng
B. Guo
Alicia DeVrio
Hong Shen
Motahhare Eslami
Kenneth Holstein
MLAU
17
58
0
07 Oct 2022
InferES : A Natural Language Inference Corpus for Spanish Featuring
  Negation-Based Contrastive and Adversarial Examples
InferES : A Natural Language Inference Corpus for Spanish Featuring Negation-Based Contrastive and Adversarial Examples
Venelin Kovatchev
Mariona Taulé
25
4
0
06 Oct 2022
State-of-the-art generalisation research in NLP: A taxonomy and review
State-of-the-art generalisation research in NLP: A taxonomy and review
Dieuwke Hupkes
Mario Giulianelli
Verna Dankers
Mikel Artetxe
Yanai Elazar
...
Leila Khalatbari
Maria Ryskina
Rita Frieske
Ryan Cotterell
Zhijing Jin
114
93
0
06 Oct 2022
12
Next