ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2311.09766
  4. Cited By
LLMs as Narcissistic Evaluators: When Ego Inflates Evaluation Scores

LLMs as Narcissistic Evaluators: When Ego Inflates Evaluation Scores

16 November 2023
Yiqi Liu
N. Moosavi
Chenghua Lin
    ELM
ArXivPDFHTML

Papers citing "LLMs as Narcissistic Evaluators: When Ego Inflates Evaluation Scores"

39 / 39 papers shown
Title
Frame In, Frame Out: Do LLMs Generate More Biased News Headlines than Humans?
Frame In, Frame Out: Do LLMs Generate More Biased News Headlines than Humans?
Valeria Pastorino
N. Moosavi
36
0
0
08 May 2025
LLM-Evaluation Tropes: Perspectives on the Validity of LLM-Evaluations
LLM-Evaluation Tropes: Perspectives on the Validity of LLM-Evaluations
Laura Dietz
Oleg Zendel
P. Bailey
Charles L. A. Clarke
Ellese Cotterill
Jeff Dalton
Faegheh Hasibi
Mark Sanderson
Nick Craswell
ELM
43
0
0
27 Apr 2025
LLM-as-a-Judge: Reassessing the Performance of LLMs in Extractive QA
LLM-as-a-Judge: Reassessing the Performance of LLMs in Extractive QA
Xanh Ho
Jiahao Huang
Florian Boudin
Akiko Aizawa
ELM
29
0
0
16 Apr 2025
VLM-R1: A Stable and Generalizable R1-style Large Vision-Language Model
VLM-R1: A Stable and Generalizable R1-style Large Vision-Language Model
Haozhan Shen
Peng Liu
J. Li
Chunxin Fang
Yibo Ma
...
Zilun Zhang
Kangjia Zhao
Qianqian Zhang
Ruochen Xu
Tiancheng Zhao
VLM
LRM
71
0
0
10 Apr 2025
Do LLM Evaluators Prefer Themselves for a Reason?
Do LLM Evaluators Prefer Themselves for a Reason?
Wei-Lin Chen
Zhepei Wei
Xinyu Zhu
Shi Feng
Yu Meng
ELM
LRM
42
0
0
04 Apr 2025
Group Preference Alignment: Customized LLM Response Generation from In-Situ Conversations
Ishani Mondal
Jack W. Stokes
S. Jauhar
Longqi Yang
Mengting Wan
Xiaofeng Xu
Xia Song
Jennifer Neville
46
0
0
11 Mar 2025
GRP: Goal-Reversed Prompting for Zero-Shot Evaluation with LLMs
Mingyang Song
Mao Zheng
Xuan Luo
LRM
58
0
0
08 Mar 2025
Prompting a Weighting Mechanism into LLM-as-a-Judge in Two-Step: A Case Study
Prompting a Weighting Mechanism into LLM-as-a-Judge in Two-Step: A Case Study
Wenwen Xie
Gray Gwizdz
Dongji Feng
75
0
0
20 Feb 2025
Cascaded Self-Evaluation Augmented Training for Lightweight Multimodal LLMs
Cascaded Self-Evaluation Augmented Training for Lightweight Multimodal LLMs
Zheqi Lv
Wenkai Wang
Jiawei Wang
Shengyu Zhang
Fei Wu
LRM
ReLM
51
0
0
10 Jan 2025
The FACTS Grounding Leaderboard: Benchmarking LLMs' Ability to Ground Responses to Long-Form Input
Alon Jacovi
Andrew Wang
Chris Alberti
Connie Tao
Jon Lipovetz
...
Rachana Fellinger
Rui Wang
Zizhao Zhang
Sasha Goldshtein
Dipanjan Das
HILM
ALM
77
11
0
06 Jan 2025
From Generation to Judgment: Opportunities and Challenges of LLM-as-a-judge
From Generation to Judgment: Opportunities and Challenges of LLM-as-a-judge
Dawei Li
Bohan Jiang
Liangjie Huang
Alimohammad Beigi
Chengshuai Zhao
...
Canyu Chen
Tianhao Wu
Kai Shu
Lu Cheng
Huan Liu
ELM
AILaw
108
61
0
25 Nov 2024
Writing Style Matters: An Examination of Bias and Fairness in
  Information Retrieval Systems
Writing Style Matters: An Examination of Bias and Fairness in Information Retrieval Systems
Hongliu Cao
64
2
0
20 Nov 2024
AGENT-CQ: Automatic Generation and Evaluation of Clarifying Questions
  for Conversational Search with LLMs
AGENT-CQ: Automatic Generation and Evaluation of Clarifying Questions for Conversational Search with LLMs
Clemencia Siro
Yifei Yuan
Mohammad Aliannejadi
Maarten de Rijke
ELM
18
2
0
25 Oct 2024
How Good Are LLMs for Literary Translation, Really? Literary Translation Evaluation with Humans and LLMs
How Good Are LLMs for Literary Translation, Really? Literary Translation Evaluation with Humans and LLMs
Ran Zhang
Wei-Ye Zhao
Steffen Eger
68
4
0
24 Oct 2024
Limits to scalable evaluation at the frontier: LLM as Judge won't beat twice the data
Limits to scalable evaluation at the frontier: LLM as Judge won't beat twice the data
Florian E. Dorner
Vivian Y. Nastl
Moritz Hardt
ELM
ALM
33
5
0
17 Oct 2024
A Critical Look at Meta-evaluating Summarisation Evaluation Metrics
A Critical Look at Meta-evaluating Summarisation Evaluation Metrics
Xiang Dai
Sarvnaz Karimi
Biaoyan Fang
17
0
0
29 Sep 2024
Questioning Internal Knowledge Structure of Large Language Models
  Through the Lens of the Olympic Games
Questioning Internal Knowledge Structure of Large Language Models Through the Lens of the Olympic Games
Juhwan Choi
Youngbin Kim
38
0
0
10 Sep 2024
From Calculation to Adjudication: Examining LLM judges on Mathematical Reasoning Tasks
From Calculation to Adjudication: Examining LLM judges on Mathematical Reasoning Tasks
Andreas Stephan
D. Zhu
Matthias Aßenmacher
Xiaoyu Shen
Benjamin Roth
ELM
45
4
0
06 Sep 2024
ConsistencyTrack: A Robust Multi-Object Tracker with a Generation
  Strategy of Consistency Model
ConsistencyTrack: A Robust Multi-Object Tracker with a Generation Strategy of Consistency Model
Lifan Jiang
Zhihui Wang
Siqi Yin
Guangxiao Ma
Peng Zhang
Boxi Wu
DiffM
51
0
0
28 Aug 2024
Self-Recognition in Language Models
Self-Recognition in Language Models
Tim R. Davidson
Viacheslav Surkov
V. Veselovsky
Giuseppe Russo
Robert West
Çağlar Gülçehre
PILM
224
2
0
09 Jul 2024
Spontaneous Reward Hacking in Iterative Self-Refinement
Spontaneous Reward Hacking in Iterative Self-Refinement
Jane Pan
He He
Samuel R. Bowman
Shi Feng
27
10
0
05 Jul 2024
TREC iKAT 2023: A Test Collection for Evaluating Conversational and
  Interactive Knowledge Assistants
TREC iKAT 2023: A Test Collection for Evaluating Conversational and Interactive Knowledge Assistants
Mohammad Aliannejadi
Zahra Abbasiantaeb
Shubham Chatterjee
Jeffery Dalton
Leif Azzopardi
28
9
0
04 May 2024
Large Language Models are Inconsistent and Biased Evaluators
Large Language Models are Inconsistent and Biased Evaluators
Rickard Stureborg
Dimitris Alikaniotis
Yoshi Suhara
ALM
18
50
0
02 May 2024
ReproHum #0087-01: Human Evaluation Reproduction Report for Generating
  Fact Checking Explanations
ReproHum #0087-01: Human Evaluation Reproduction Report for Generating Fact Checking Explanations
Tyler Loakman
Chenghua Lin
24
0
0
26 Apr 2024
Unifying Bias and Unfairness in Information Retrieval: A Survey of
  Challenges and Opportunities with Large Language Models
Unifying Bias and Unfairness in Information Retrieval: A Survey of Challenges and Opportunities with Large Language Models
Sunhao Dai
Chen Xu
Shicheng Xu
Liang Pang
Zhenhua Dong
Jun Xu
42
60
0
17 Apr 2024
LLM Evaluators Recognize and Favor Their Own Generations
LLM Evaluators Recognize and Favor Their Own Generations
Arjun Panickssery
Samuel R. Bowman
Shi Feng
36
152
0
15 Apr 2024
Is LLM-as-a-Judge Robust? Investigating Universal Adversarial Attacks on
  Zero-shot LLM Assessment
Is LLM-as-a-Judge Robust? Investigating Universal Adversarial Attacks on Zero-shot LLM Assessment
Vyas Raina
Adian Liusie
Mark J. F. Gales
AAML
ELM
21
51
0
21 Feb 2024
MONAL: Model Autophagy Analysis for Modeling Human-AI Interactions
MONAL: Model Autophagy Analysis for Modeling Human-AI Interactions
Shu Yang
Muhammad Asif Ali
Lu Yu
Lijie Hu
Di Wang
LLMAG
16
2
0
17 Feb 2024
Humans or LLMs as the Judge? A Study on Judgement Biases
Humans or LLMs as the Judge? A Study on Judgement Biases
Guiming Hardy Chen
Shunian Chen
Ziche Liu
Feng Jiang
Benyou Wang
74
89
0
16 Feb 2024
Leveraging Large Language Models for NLG Evaluation: Advances and
  Challenges
Leveraging Large Language Models for NLG Evaluation: Advances and Challenges
Zhen Li
Xiaohan Xu
Tao Shen
Can Xu
Jia-Chen Gu
Yuxuan Lai
Chongyang Tao
Shuai Ma
LM&MA
ELM
26
9
0
13 Jan 2024
Navigating the Metrics Maze: Reconciling Score Magnitudes and Accuracies
Navigating the Metrics Maze: Reconciling Score Magnitudes and Accuracies
Tom Kocmi
Vilém Zouhar
C. Federmann
Matt Post
21
26
0
12 Jan 2024
LIMIT: Less Is More for Instruction Tuning Across Evaluation Paradigms
LIMIT: Less Is More for Instruction Tuning Across Evaluation Paradigms
Aditi Jha
Sam Havens
Jeremey Dohmann
Alex Trott
Jacob P. Portes
ALM
14
10
0
22 Nov 2023
Safer-Instruct: Aligning Language Models with Automated Preference Data
Safer-Instruct: Aligning Language Models with Automated Preference Data
Taiwei Shi
Kai Chen
Jieyu Zhao
ALM
SyDa
11
20
0
15 Nov 2023
How Far are We from Robust Long Abstractive Summarization?
How Far are We from Robust Long Abstractive Summarization?
Huan Yee Koh
Jiaxin Ju
He Zhang
Ming Liu
Shirui Pan
HILM
23
39
0
30 Oct 2022
On the Limitations of Reference-Free Evaluations of Generated Text
On the Limitations of Reference-Free Evaluations of Generated Text
Daniel Deutsch
Rotem Dror
Dan Roth
27
44
0
22 Oct 2022
Layer or Representation Space: What makes BERT-based Evaluation Metrics
  Robust?
Layer or Representation Space: What makes BERT-based Evaluation Metrics Robust?
Doan Nam Long Vu
N. Moosavi
Steffen Eger
9
9
0
06 Sep 2022
Perturbation CheckLists for Evaluating NLG Evaluation Metrics
Perturbation CheckLists for Evaluating NLG Evaluation Metrics
Ananya B. Sai
Tanay Dixit
D. Y. Sheth
S. Mohan
Mitesh M. Khapra
AAML
97
55
0
13 Sep 2021
The GEM Benchmark: Natural Language Generation, its Evaluation and
  Metrics
The GEM Benchmark: Natural Language Generation, its Evaluation and Metrics
Sebastian Gehrmann
Tosin P. Adewumi
Karmanya Aggarwal
Pawan Sasanka Ammanamanchi
Aremu Anuoluwapo
...
Nishant Subramani
Wei-ping Xu
Diyi Yang
Akhila Yerukola
Jiawei Zhou
VLM
238
284
0
02 Feb 2021
Teaching Machines to Read and Comprehend
Teaching Machines to Read and Comprehend
Karl Moritz Hermann
Tomás Kociský
Edward Grefenstette
L. Espeholt
W. Kay
Mustafa Suleyman
Phil Blunsom
170
3,504
0
10 Jun 2015
1