Papers
Communities
Events
Blog
Pricing
Search
Open menu
Home
Papers
2212.10020
Cited By
On the Blind Spots of Model-Based Evaluation Metrics for Text Generation
20 December 2022
Tianxing He
Jingyu Zhang
Tianle Wang
Sachin Kumar
Kyunghyun Cho
James R. Glass
Yulia Tsvetkov
Re-assign community
ArXiv
PDF
HTML
Papers citing
"On the Blind Spots of Model-Based Evaluation Metrics for Text Generation"
42 / 42 papers shown
Title
LLM-based NLG Evaluation: Current Status and Challenges
Mingqi Gao
Xinyu Hu
Jie Ruan
Xiao Pu
Xiaojun Wan
ELM
LM&MA
40
28
0
15 May 2025
Contextual Metric Meta-Evaluation by Measuring Local Metric Accuracy
Athiya Deviyani
Fernando Diaz
26
0
0
25 Mar 2025
Can one size fit all?: Measuring Failure in Multi-Document Summarization Domain Transfer
Alexandra DeLucia
Mark Dredze
32
0
0
20 Mar 2025
What Makes a Good Story and How Can We Measure It? A Comprehensive Survey of Story Evaluation
Dingyi Yang
Qin Jin
26
5
0
26 Aug 2024
Automatic Metrics in Natural Language Generation: A Survey of Current Evaluation Practices
Patrícia Schmidtová
Saad Mahamood
Simone Balloccu
Ondřej Dušek
Albert Gatt
Dimitra Gkatzia
David M. Howcroft
Ondřej Plátek
Adarsa Sivaprasad
27
0
0
17 Aug 2024
Diffusion Guided Language Modeling
Justin Lovelace
Varsha Kishore
Yiwei Chen
Kilian Q. Weinberger
21
6
0
08 Aug 2024
Finding Blind Spots in Evaluator LLMs with Interpretable Checklists
Sumanth Doddapaneni
Mohammed Safi Ur Rahman Khan
Sshubam Verma
Mitesh Khapra
34
11
0
19 Jun 2024
Text Generation: A Systematic Literature Review of Tasks, Evaluation, and Challenges
Jonas Becker
Jan Philip Wahle
Bela Gipp
Terry Ruas
16
1
0
24 May 2024
Verifiable by Design: Aligning Language Models to Quote from Pre-Training Data
Jingyu Zhang
Marc Marone
Tianjian Li
Benjamin Van Durme
Daniel Khashabi
70
9
0
05 Apr 2024
SELF-[IN]CORRECT: LLMs Struggle with Refining Self-Generated Responses
Dongwei Jiang
Jingyu Zhang
Orion Weller
Nathaniel Weir
Benjamin Van Durme
Daniel Khashabi
33
1
0
04 Apr 2024
MT-Bench-101: A Fine-Grained Benchmark for Evaluating Large Language Models in Multi-Turn Dialogues
Ge Bai
Jie Liu
Xingyuan Bu
Yancheng He
Jiaheng Liu
...
Zhuoran Lin
Wenbo Su
Tiezheng Ge
Bo Zheng
Wanli Ouyang
ELM
LM&MA
22
68
0
22 Feb 2024
Are LLM-based Evaluators Confusing NLG Quality Criteria?
Xinyu Hu
Mingqi Gao
Sen Hu
Yang Zhang
Yicheng Chen
Teng Xu
Xiaojun Wan
AAML
ELM
21
6
0
19 Feb 2024
Stumbling Blocks: Stress Testing the Robustness of Machine-Generated Text Detectors Under Attacks
Yichen Wang
Shangbin Feng
Abe Bohan Hou
Xiao Pu
Chao Shen
Xiaoming Liu
Yulia Tsvetkov
Tianxing He
DeLMO
20
17
0
18 Feb 2024
CritiqueLLM: Towards an Informative Critique Generation Model for Evaluation of Large Language Model Generation
Pei Ke
Bosi Wen
Andrew Feng
Xiao-Yang Liu
Xuanyu Lei
...
Aohan Zeng
Yuxiao Dong
Hongning Wang
Jie Tang
Minlie Huang
ELM
ALM
27
22
0
30 Nov 2023
LLMs as Narcissistic Evaluators: When Ego Inflates Evaluation Scores
Yiqi Liu
N. Moosavi
Chenghua Lin
ELM
14
48
0
16 Nov 2023
The Eval4NLP 2023 Shared Task on Prompting Large Language Models as Explainable Metrics
Christoph Leiter
Juri Opitz
Daniel Deutsch
Yang Gao
Rotem Dror
Steffen Eger
ALM
LRM
ELM
19
31
0
30 Oct 2023
BLESS: Benchmarking Large Language Models on Sentence Simplification
Tannon Kew
Alison Chi
Laura Vásquez-Rodríguez
Sweta Agrawal
Dennis Aumiller
Fernando Alva-Manchego
Teven Le Scao
22
21
0
24 Oct 2023
ALLURE: Auditing and Improving LLM-based Evaluation of Text using Iterative In-Context-Learning
Hosein Hasanbeig
Hiteshi Sharma
Leo Betthauser
Felipe Vieira Frujeri
Ida Momennejad
17
13
0
24 Sep 2023
HANS, are you clever? Clever Hans Effect Analysis of Neural Systems
Leonardo Ranaldi
Fabio Massimo Zanzotto
13
1
0
21 Sep 2023
Uncertainty in Natural Language Generation: From Theory to Applications
Joris Baan
Nico Daheim
Evgenia Ilia
Dennis Ulmer
Haau-Sing Li
Raquel Fernández
Barbara Plank
Rico Sennrich
Chrysoula Zerva
Wilker Aziz
UQLM
12
39
0
28 Jul 2023
Understanding In-Context Learning via Supportive Pretraining Data
Xiaochuang Han
Daniel Simig
Todor Mihaylov
Yulia Tsvetkov
Asli Celikyilmaz
Tianlu Wang
AIMat
21
33
0
26 Jun 2023
Towards Explainable Evaluation Metrics for Machine Translation
Christoph Leiter
Piyawat Lertvittayakumjorn
M. Fomicheva
Wei-Ye Zhao
Yang Gao
Steffen Eger
ELM
10
11
0
22 Jun 2023
Large Language Models are not Fair Evaluators
Peiyi Wang
Lei Li
Liang Chen
Zefan Cai
Dawei Zhu
Binghuai Lin
Yunbo Cao
Qi Liu
Tianyu Liu
Zhifang Sui
ALM
12
505
0
29 May 2023
Do GPTs Produce Less Literal Translations?
Vikas Raunak
Arul Menezes
Matt Post
H. Awadallah
27
20
0
26 May 2023
Is Summary Useful or Not? An Extrinsic Human Evaluation of Text Summaries on Downstream Tasks
Xiao Pu
Mingqi Gao
Xiaojun Wan
ELM
8
3
0
24 May 2023
APPLS: Evaluating Evaluation Metrics for Plain Language Summarization
Yue Guo
Tal August
Gondy Leroy
T. Cohen
Lucy Lu Wang
43
7
0
23 May 2023
Learning Human-Human Interactions in Images from Weak Textual Supervision
Morris Alper
Hadar Averbuch-Elor
VLM
37
2
0
27 Apr 2023
DeltaScore: Fine-Grained Story Evaluation with Perturbations
Zhuohan Xie
Miao Li
Trevor Cohn
Jey Han Lau
17
4
0
15 Mar 2023
Complex QA and language models hybrid architectures, Survey
Xavier Daull
P. Bellot
Emmanuel Bruno
Vincent Martin
Elisabeth Murisasco
ELM
13
15
0
17 Feb 2023
Not All Errors are Equal: Learning Text Generation Metrics using Stratified Error Synthesis
Wenda Xu
Yi-Lin Tuan
Yujie Lu
Michael Stephen Saxon
Lei Li
William Yang Wang
31
22
0
10 Oct 2022
Layer or Representation Space: What makes BERT-based Evaluation Metrics Robust?
Doan Nam Long Vu
N. Moosavi
Steffen Eger
4
9
0
06 Sep 2022
MENLI: Robust Evaluation Metrics from Natural Language Inference
Yanran Chen
Steffen Eger
14
15
0
15 Aug 2022
On the Usefulness of Embeddings, Clusters and Strings for Text Generator Evaluation
Tiago Pimentel
Clara Meister
Ryan Cotterell
35
7
0
31 May 2022
Bidimensional Leaderboards: Generate and Evaluate Language Hand in Hand
Jungo Kasai
Keisuke Sakaguchi
Ronan Le Bras
Lavinia Dunagan
Jacob Morrison
Alexander R. Fabbri
Yejin Choi
Noah A. Smith
38
39
0
08 Dec 2021
Understanding Factuality in Abstractive Summarization with FRANK: A Benchmark for Factuality Metrics
Artidoro Pagnoni
Vidhisha Balachandran
Yulia Tsvetkov
HILM
215
305
0
27 Apr 2021
Probing Classifiers: Promises, Shortcomings, and Advances
Yonatan Belinkov
221
291
0
24 Feb 2021
The GEM Benchmark: Natural Language Generation, its Evaluation and Metrics
Sebastian Gehrmann
Tosin P. Adewumi
Karmanya Aggarwal
Pawan Sasanka Ammanamanchi
Aremu Anuoluwapo
...
Nishant Subramani
Wei-ping Xu
Diyi Yang
Akhila Yerukola
Jiawei Zhou
VLM
238
254
0
02 Feb 2021
Robustness Gym: Unifying the NLP Evaluation Landscape
Karan Goel
Nazneen Rajani
Jesse Vig
Samson Tan
Jason M. Wu
Stephan Zheng
Caiming Xiong
Mohit Bansal
Christopher Ré
AAML
OffRL
OOD
133
136
0
13 Jan 2021
On-the-Fly Attention Modulation for Neural Generation
Yue Dong
Chandra Bhagavatula
Ximing Lu
Jena D. Hwang
Antoine Bosselut
Jackie C.K. Cheung
Yejin Choi
35
11
0
02 Jan 2021
Out of Order: How Important Is The Sequential Order of Words in a Sentence in Natural Language Understanding Tasks?
Thang M. Pham
Trung Bui
Long Mai
Anh Totti Nguyen
195
122
0
30 Dec 2020
GO FIGURE: A Meta Evaluation of Factuality in Summarization
Saadia Gabriel
Asli Celikyilmaz
Rahul Jha
Yejin Choi
Jianfeng Gao
HILM
222
96
0
24 Oct 2020
Teaching Machines to Read and Comprehend
Karl Moritz Hermann
Tomás Kociský
Edward Grefenstette
L. Espeholt
W. Kay
Mustafa Suleyman
Phil Blunsom
167
3,504
0
10 Jun 2015
1