LLMs as Narcissistic Evaluators: When Ego Inflates Evaluation Scores

16 November 2023

Papers citing "LLMs as Narcissistic Evaluators: When Ego Inflates Evaluation Scores"

39 / 39 papers shown

Title
Frame In, Frame Out: Do LLMs Generate More Biased News Headlines than Humans? Valeria Pastorino N. Moosavi 36 0 0 08 May 2025
LLM-Evaluation Tropes: Perspectives on the Validity of LLM-Evaluations Laura Dietz Oleg Zendel P. Bailey Charles L. A. Clarke Ellese Cotterill Jeff Dalton Faegheh Hasibi Mark Sanderson Nick Craswell ELM 43 0 0 27 Apr 2025
LLM-as-a-Judge: Reassessing the Performance of LLMs in Extractive QA Xanh Ho Jiahao Huang Florian Boudin Akiko Aizawa ELM 29 0 0 16 Apr 2025
VLM-R1: A Stable and Generalizable R1-style Large Vision-Language Model Haozhan Shen Peng Liu J. Li Chunxin Fang Yibo Ma ... Zilun Zhang Kangjia Zhao Qianqian Zhang Ruochen Xu Tiancheng Zhao VLM LRM 71 0 0 10 Apr 2025
Do LLM Evaluators Prefer Themselves for a Reason? Wei-Lin Chen Zhepei Wei Xinyu Zhu Shi Feng Yu Meng ELM LRM 42 0 0 04 Apr 2025
Group Preference Alignment: Customized LLM Response Generation from In-Situ Conversations Ishani Mondal Jack W. Stokes S. Jauhar Longqi Yang Mengting Wan Xiaofeng Xu Xia Song Jennifer Neville 46 0 0 11 Mar 2025
GRP: Goal-Reversed Prompting for Zero-Shot Evaluation with LLMs Mingyang Song Mao Zheng Xuan Luo LRM 58 0 0 08 Mar 2025
Prompting a Weighting Mechanism into LLM-as-a-Judge in Two-Step: A Case Study Wenwen Xie Gray Gwizdz Dongji Feng 75 0 0 20 Feb 2025
Cascaded Self-Evaluation Augmented Training for Lightweight Multimodal LLMs Zheqi Lv Wenkai Wang Jiawei Wang Shengyu Zhang Fei Wu LRM ReLM 51 0 0 10 Jan 2025
The FACTS Grounding Leaderboard: Benchmarking LLMs' Ability to Ground Responses to Long-Form Input Alon Jacovi Andrew Wang Chris Alberti Connie Tao Jon Lipovetz ... Rachana Fellinger Rui Wang Zizhao Zhang Sasha Goldshtein Dipanjan Das HILM ALM 77 11 0 06 Jan 2025
From Generation to Judgment: Opportunities and Challenges of LLM-as-a-judge Dawei Li Bohan Jiang Liangjie Huang Alimohammad Beigi Chengshuai Zhao ... Canyu Chen Tianhao Wu Kai Shu Lu Cheng Huan Liu ELM AILaw 108 61 0 25 Nov 2024
Writing Style Matters: An Examination of Bias and Fairness in Information Retrieval Systems Hongliu Cao 64 2 0 20 Nov 2024
AGENT-CQ: Automatic Generation and Evaluation of Clarifying Questions for Conversational Search with LLMs Clemencia Siro Yifei Yuan Mohammad Aliannejadi Maarten de Rijke ELM 18 2 0 25 Oct 2024
How Good Are LLMs for Literary Translation, Really? Literary Translation Evaluation with Humans and LLMs Ran Zhang Wei-Ye Zhao Steffen Eger 68 4 0 24 Oct 2024
Limits to scalable evaluation at the frontier: LLM as Judge won't beat twice the data Florian E. Dorner Vivian Y. Nastl Moritz Hardt ELM ALM 33 5 0 17 Oct 2024
A Critical Look at Meta-evaluating Summarisation Evaluation Metrics Xiang Dai Sarvnaz Karimi Biaoyan Fang 17 0 0 29 Sep 2024
Questioning Internal Knowledge Structure of Large Language Models Through the Lens of the Olympic Games Juhwan Choi Youngbin Kim 38 0 0 10 Sep 2024
From Calculation to Adjudication: Examining LLM judges on Mathematical Reasoning Tasks Andreas Stephan D. Zhu Matthias Aßenmacher Xiaoyu Shen Benjamin Roth ELM 45 4 0 06 Sep 2024
ConsistencyTrack: A Robust Multi-Object Tracker with a Generation Strategy of Consistency Model Lifan Jiang Zhihui Wang Siqi Yin Guangxiao Ma Peng Zhang Boxi Wu DiffM 51 0 0 28 Aug 2024
Self-Recognition in Language Models Tim R. Davidson Viacheslav Surkov V. Veselovsky Giuseppe Russo Robert West Çağlar Gülçehre PILM 224 2 0 09 Jul 2024
Spontaneous Reward Hacking in Iterative Self-Refinement Jane Pan He He Samuel R. Bowman Shi Feng 27 10 0 05 Jul 2024
TREC iKAT 2023: A Test Collection for Evaluating Conversational and Interactive Knowledge Assistants Mohammad Aliannejadi Zahra Abbasiantaeb Shubham Chatterjee Jeffery Dalton Leif Azzopardi 28 9 0 04 May 2024
Large Language Models are Inconsistent and Biased Evaluators Rickard Stureborg Dimitris Alikaniotis Yoshi Suhara ALM 18 50 0 02 May 2024
ReproHum #0087-01: Human Evaluation Reproduction Report for Generating Fact Checking Explanations Tyler Loakman Chenghua Lin 24 0 0 26 Apr 2024
Unifying Bias and Unfairness in Information Retrieval: A Survey of Challenges and Opportunities with Large Language Models Sunhao Dai Chen Xu Shicheng Xu Liang Pang Zhenhua Dong Jun Xu 42 60 0 17 Apr 2024
LLM Evaluators Recognize and Favor Their Own Generations Arjun Panickssery Samuel R. Bowman Shi Feng 36 152 0 15 Apr 2024
Is LLM-as-a-Judge Robust? Investigating Universal Adversarial Attacks on Zero-shot LLM Assessment Vyas Raina Adian Liusie Mark J. F. Gales AAML ELM 21 51 0 21 Feb 2024
MONAL: Model Autophagy Analysis for Modeling Human-AI Interactions Shu Yang Muhammad Asif Ali Lu Yu Lijie Hu Di Wang LLMAG 16 2 0 17 Feb 2024
Humans or LLMs as the Judge? A Study on Judgement Biases Guiming Hardy Chen Shunian Chen Ziche Liu Feng Jiang Benyou Wang 74 89 0 16 Feb 2024
Leveraging Large Language Models for NLG Evaluation: Advances and Challenges Zhen Li Xiaohan Xu Tao Shen Can Xu Jia-Chen Gu Yuxuan Lai Chongyang Tao Shuai Ma LM&MA ELM 26 9 0 13 Jan 2024
Navigating the Metrics Maze: Reconciling Score Magnitudes and Accuracies Tom Kocmi Vilém Zouhar C. Federmann Matt Post 21 26 0 12 Jan 2024
LIMIT: Less Is More for Instruction Tuning Across Evaluation Paradigms Aditi Jha Sam Havens Jeremey Dohmann Alex Trott Jacob P. Portes ALM 14 10 0 22 Nov 2023
Safer-Instruct: Aligning Language Models with Automated Preference Data Taiwei Shi Kai Chen Jieyu Zhao ALM SyDa 11 20 0 15 Nov 2023
How Far are We from Robust Long Abstractive Summarization? Huan Yee Koh Jiaxin Ju He Zhang Ming Liu Shirui Pan HILM 23 39 0 30 Oct 2022
On the Limitations of Reference-Free Evaluations of Generated Text Daniel Deutsch Rotem Dror Dan Roth 27 44 0 22 Oct 2022
Layer or Representation Space: What makes BERT-based Evaluation Metrics Robust? Doan Nam Long Vu N. Moosavi Steffen Eger 9 9 0 06 Sep 2022
Perturbation CheckLists for Evaluating NLG Evaluation Metrics Ananya B. Sai Tanay Dixit D. Y. Sheth S. Mohan Mitesh M. Khapra AAML 97 55 0 13 Sep 2021
The GEM Benchmark: Natural Language Generation, its Evaluation and Metrics Sebastian Gehrmann Tosin P. Adewumi Karmanya Aggarwal Pawan Sasanka Ammanamanchi Aremu Anuoluwapo ... Nishant Subramani Wei-ping Xu Diyi Yang Akhila Yerukola Jiawei Zhou VLM 238 284 0 02 Feb 2021
Teaching Machines to Read and Comprehend Karl Moritz Hermann Tomás Kociský Edward Grefenstette L. Espeholt W. Kay Mustafa Suleyman Phil Blunsom 170 3,504 0 10 Jun 2015