Papers
Communities
Events
Blog
Pricing
Search
Open menu
Home
Papers
2311.09766
Cited By
LLMs as Narcissistic Evaluators: When Ego Inflates Evaluation Scores
16 November 2023
Yiqi Liu
N. Moosavi
Chenghua Lin
ELM
Re-assign community
ArXiv
PDF
HTML
Papers citing
"LLMs as Narcissistic Evaluators: When Ego Inflates Evaluation Scores"
39 / 39 papers shown
Title
Frame In, Frame Out: Do LLMs Generate More Biased News Headlines than Humans?
Valeria Pastorino
N. Moosavi
36
0
0
08 May 2025
LLM-Evaluation Tropes: Perspectives on the Validity of LLM-Evaluations
Laura Dietz
Oleg Zendel
P. Bailey
Charles L. A. Clarke
Ellese Cotterill
Jeff Dalton
Faegheh Hasibi
Mark Sanderson
Nick Craswell
ELM
43
0
0
27 Apr 2025
LLM-as-a-Judge: Reassessing the Performance of LLMs in Extractive QA
Xanh Ho
Jiahao Huang
Florian Boudin
Akiko Aizawa
ELM
29
0
0
16 Apr 2025
VLM-R1: A Stable and Generalizable R1-style Large Vision-Language Model
Haozhan Shen
Peng Liu
J. Li
Chunxin Fang
Yibo Ma
...
Zilun Zhang
Kangjia Zhao
Qianqian Zhang
Ruochen Xu
Tiancheng Zhao
VLM
LRM
71
0
0
10 Apr 2025
Do LLM Evaluators Prefer Themselves for a Reason?
Wei-Lin Chen
Zhepei Wei
Xinyu Zhu
Shi Feng
Yu Meng
ELM
LRM
42
0
0
04 Apr 2025
Group Preference Alignment: Customized LLM Response Generation from In-Situ Conversations
Ishani Mondal
Jack W. Stokes
S. Jauhar
Longqi Yang
Mengting Wan
Xiaofeng Xu
Xia Song
Jennifer Neville
46
0
0
11 Mar 2025
GRP: Goal-Reversed Prompting for Zero-Shot Evaluation with LLMs
Mingyang Song
Mao Zheng
Xuan Luo
LRM
58
0
0
08 Mar 2025
Prompting a Weighting Mechanism into LLM-as-a-Judge in Two-Step: A Case Study
Wenwen Xie
Gray Gwizdz
Dongji Feng
75
0
0
20 Feb 2025
Cascaded Self-Evaluation Augmented Training for Lightweight Multimodal LLMs
Zheqi Lv
Wenkai Wang
Jiawei Wang
Shengyu Zhang
Fei Wu
LRM
ReLM
51
0
0
10 Jan 2025
The FACTS Grounding Leaderboard: Benchmarking LLMs' Ability to Ground Responses to Long-Form Input
Alon Jacovi
Andrew Wang
Chris Alberti
Connie Tao
Jon Lipovetz
...
Rachana Fellinger
Rui Wang
Zizhao Zhang
Sasha Goldshtein
Dipanjan Das
HILM
ALM
77
11
0
06 Jan 2025
From Generation to Judgment: Opportunities and Challenges of LLM-as-a-judge
Dawei Li
Bohan Jiang
Liangjie Huang
Alimohammad Beigi
Chengshuai Zhao
...
Canyu Chen
Tianhao Wu
Kai Shu
Lu Cheng
Huan Liu
ELM
AILaw
108
61
0
25 Nov 2024
Writing Style Matters: An Examination of Bias and Fairness in Information Retrieval Systems
Hongliu Cao
64
2
0
20 Nov 2024
AGENT-CQ: Automatic Generation and Evaluation of Clarifying Questions for Conversational Search with LLMs
Clemencia Siro
Yifei Yuan
Mohammad Aliannejadi
Maarten de Rijke
ELM
18
2
0
25 Oct 2024
How Good Are LLMs for Literary Translation, Really? Literary Translation Evaluation with Humans and LLMs
Ran Zhang
Wei-Ye Zhao
Steffen Eger
68
4
0
24 Oct 2024
Limits to scalable evaluation at the frontier: LLM as Judge won't beat twice the data
Florian E. Dorner
Vivian Y. Nastl
Moritz Hardt
ELM
ALM
33
5
0
17 Oct 2024
A Critical Look at Meta-evaluating Summarisation Evaluation Metrics
Xiang Dai
Sarvnaz Karimi
Biaoyan Fang
17
0
0
29 Sep 2024
Questioning Internal Knowledge Structure of Large Language Models Through the Lens of the Olympic Games
Juhwan Choi
Youngbin Kim
38
0
0
10 Sep 2024
From Calculation to Adjudication: Examining LLM judges on Mathematical Reasoning Tasks
Andreas Stephan
D. Zhu
Matthias Aßenmacher
Xiaoyu Shen
Benjamin Roth
ELM
45
4
0
06 Sep 2024
ConsistencyTrack: A Robust Multi-Object Tracker with a Generation Strategy of Consistency Model
Lifan Jiang
Zhihui Wang
Siqi Yin
Guangxiao Ma
Peng Zhang
Boxi Wu
DiffM
51
0
0
28 Aug 2024
Self-Recognition in Language Models
Tim R. Davidson
Viacheslav Surkov
V. Veselovsky
Giuseppe Russo
Robert West
Çağlar Gülçehre
PILM
224
2
0
09 Jul 2024
Spontaneous Reward Hacking in Iterative Self-Refinement
Jane Pan
He He
Samuel R. Bowman
Shi Feng
27
10
0
05 Jul 2024
TREC iKAT 2023: A Test Collection for Evaluating Conversational and Interactive Knowledge Assistants
Mohammad Aliannejadi
Zahra Abbasiantaeb
Shubham Chatterjee
Jeffery Dalton
Leif Azzopardi
28
9
0
04 May 2024
Large Language Models are Inconsistent and Biased Evaluators
Rickard Stureborg
Dimitris Alikaniotis
Yoshi Suhara
ALM
18
50
0
02 May 2024
ReproHum #0087-01: Human Evaluation Reproduction Report for Generating Fact Checking Explanations
Tyler Loakman
Chenghua Lin
24
0
0
26 Apr 2024
Unifying Bias and Unfairness in Information Retrieval: A Survey of Challenges and Opportunities with Large Language Models
Sunhao Dai
Chen Xu
Shicheng Xu
Liang Pang
Zhenhua Dong
Jun Xu
42
60
0
17 Apr 2024
LLM Evaluators Recognize and Favor Their Own Generations
Arjun Panickssery
Samuel R. Bowman
Shi Feng
36
152
0
15 Apr 2024
Is LLM-as-a-Judge Robust? Investigating Universal Adversarial Attacks on Zero-shot LLM Assessment
Vyas Raina
Adian Liusie
Mark J. F. Gales
AAML
ELM
21
51
0
21 Feb 2024
MONAL: Model Autophagy Analysis for Modeling Human-AI Interactions
Shu Yang
Muhammad Asif Ali
Lu Yu
Lijie Hu
Di Wang
LLMAG
16
2
0
17 Feb 2024
Humans or LLMs as the Judge? A Study on Judgement Biases
Guiming Hardy Chen
Shunian Chen
Ziche Liu
Feng Jiang
Benyou Wang
74
89
0
16 Feb 2024
Leveraging Large Language Models for NLG Evaluation: Advances and Challenges
Zhen Li
Xiaohan Xu
Tao Shen
Can Xu
Jia-Chen Gu
Yuxuan Lai
Chongyang Tao
Shuai Ma
LM&MA
ELM
26
9
0
13 Jan 2024
Navigating the Metrics Maze: Reconciling Score Magnitudes and Accuracies
Tom Kocmi
Vilém Zouhar
C. Federmann
Matt Post
21
26
0
12 Jan 2024
LIMIT: Less Is More for Instruction Tuning Across Evaluation Paradigms
Aditi Jha
Sam Havens
Jeremey Dohmann
Alex Trott
Jacob P. Portes
ALM
14
10
0
22 Nov 2023
Safer-Instruct: Aligning Language Models with Automated Preference Data
Taiwei Shi
Kai Chen
Jieyu Zhao
ALM
SyDa
11
20
0
15 Nov 2023
How Far are We from Robust Long Abstractive Summarization?
Huan Yee Koh
Jiaxin Ju
He Zhang
Ming Liu
Shirui Pan
HILM
23
39
0
30 Oct 2022
On the Limitations of Reference-Free Evaluations of Generated Text
Daniel Deutsch
Rotem Dror
Dan Roth
27
44
0
22 Oct 2022
Layer or Representation Space: What makes BERT-based Evaluation Metrics Robust?
Doan Nam Long Vu
N. Moosavi
Steffen Eger
9
9
0
06 Sep 2022
Perturbation CheckLists for Evaluating NLG Evaluation Metrics
Ananya B. Sai
Tanay Dixit
D. Y. Sheth
S. Mohan
Mitesh M. Khapra
AAML
97
55
0
13 Sep 2021
The GEM Benchmark: Natural Language Generation, its Evaluation and Metrics
Sebastian Gehrmann
Tosin P. Adewumi
Karmanya Aggarwal
Pawan Sasanka Ammanamanchi
Aremu Anuoluwapo
...
Nishant Subramani
Wei-ping Xu
Diyi Yang
Akhila Yerukola
Jiawei Zhou
VLM
238
284
0
02 Feb 2021
Teaching Machines to Read and Comprehend
Karl Moritz Hermann
Tomás Kociský
Edward Grefenstette
L. Espeholt
W. Kay
Mustafa Suleyman
Phil Blunsom
170
3,504
0
10 Jun 2015
1