Papers
Communities
Organizations
Events
Blog
Pricing
Feedback
Contact Sales
Search
Open menu
Home
Papers
2406.12624
Cited By
v1
v2
v3
v4
v5 (latest)
Judging the Judges: Evaluating Alignment and Vulnerabilities in LLMs-as-Judges
18 June 2024
Aman Singh Thakur
Kartik Choudhary
Venkat Srinik Ramayapally
Sankaran Vaidyanathan
Dieuwke Hupkes
ELM
ALM
Re-assign community
ArXiv (abs)
PDF
HTML
HuggingFace (38 upvotes)
Papers citing
"Judging the Judges: Evaluating Alignment and Vulnerabilities in LLMs-as-Judges"
50 / 104 papers shown
Title
Automatic Failure Attribution and Critical Step Prediction Method for Multi-Agent Systems Based on Causal Inference
Guoqing Ma
Jia Zhu
Hanghui Guo
Weijie Shi
Jiawei Shen
Jingjiang Liu
Yidan Liang
0
0
0
10 Sep 2025
MMReview: A Multidisciplinary and Multimodal Benchmark for LLM-Based Peer Review Automation
Xian Gao
Jiacheng Ruan
Zongyun Zhang
Jingsheng Gao
Ting Liu
Yuzhuo Fu
VLM
28
0
0
19 Aug 2025
Bridging Human and LLM Judgments: Understanding and Narrowing the Gap
Felipe Maia Polo
Xinhe Wang
Mikhail Yurochkin
Gongjun Xu
Moulinath Banerjee
Yuekai Sun
ALM
ELM
32
0
0
18 Aug 2025
Evaluating Podcast Recommendations with Profile-Aware LLM-as-a-Judge
Francesco Fabbri
Gustavo Penha
Edoardo DÁmico
Alice Wang
Marco De Nadai
Jackie Doremus
Paul Gigioli
Andreas Damianou
Oskar Stal
M. Lalmas
OffRL
20
0
0
12 Aug 2025
Exploring Safety Alignment Evaluation of LLMs in Chinese Mental Health Dialogues via LLM-as-Judge
Yunna Cai
Fan Wang
Haowei Wang
Kun Wang
Kailai Yang
Sophia Ananiadou
Moyan Li
Mingming Fan
ELM
24
0
0
11 Aug 2025
Towards Robust Evaluation of Visual Activity Recognition: Resolving Verb Ambiguity with Sense Clustering
Louie Hong Yao
Nicholas Jarvis
Tianyu Jiang
24
0
0
07 Aug 2025
The Illusion of Progress: Re-evaluating Hallucination Detection in LLMs
Denis Janiak
Jakub Binkowski
Albert Sawczyn
Bogdan Gabrys
Ravid Schwartz-Ziv
Tomasz Kajdanowicz
HILM
58
0
0
01 Aug 2025
Learning an Efficient Multi-Turn Dialogue Evaluator from Multiple Judges
Yuqi Tang
Kehua Feng
Yunfeng Wang
Zhiwen Chen
Chengfei Lv
Gang Yu
Qiang Zhang
Keyan Ding
ELM
54
0
0
01 Aug 2025
Evaluating Scoring Bias in LLM-as-a-Judge
Qingquan Li
Shaoyu Dou
Kailai Shao
Chao Chen
Haixiang Hu
ELM
58
1
0
27 Jun 2025
Revealing Political Bias in LLMs through Structured Multi-Agent Debate
Aishwarya Bandaru
Fabian Bindley
Trevor Bluth
Nandini Chavda
Baixu Chen
Ethan Law
LLMAG
43
0
0
13 Jun 2025
AbstentionBench: Reasoning LLMs Fail on Unanswerable Questions
Polina Kirichenko
Mark Ibrahim
Kamalika Chaudhuri
Samuel J. Bell
LRM
80
3
0
10 Jun 2025
Cost-Optimal Active AI Model Evaluation
Anastasios Nikolas Angelopoulos
Jacob Eisenstein
Jonathan Berant
Alekh Agarwal
Adam Fisch
77
0
0
09 Jun 2025
Quantitative LLM Judges
Aishwarya Sahoo
Jeevana Kruthi Karnuthala
Tushar Parmanand Budhwani
Pranchal Agarwal
Sankaran Vaidyanathan
...
Jennifer Healey
Nedim Lipka
Ryan Rossi
Uttaran Bhattacharya
Branislav Kveton
ELM
126
0
0
03 Jun 2025
Dissecting Physics Reasoning in Small Language Models: A Multi-Dimensional Analysis from an Educational Perspective
Nicy Scaria
Silvester John Joseph Kennedy
Diksha Seth
Deepak N. Subramani
LRM
86
0
0
27 May 2025
MedGUIDE: Benchmarking Clinical Decision-Making in Large Language Models
Xiaomin Li
Mingye Gao
Yuexing Hao
Taoran Li
Guangya Wan
Zihan Wang
Yijun Wang
LM&MA
ELM
AI4MH
178
1
0
16 May 2025
clem:todd: A Framework for the Systematic Benchmarking of LLM-Based Task-Oriented Dialogue System Realisations
Chalamalasetti Kranti
Sherzod Hakimov
David Schlangen
LLMAG
177
0
0
08 May 2025
MedArabiQ: Benchmarking Large Language Models on Arabic Medical Tasks
Mouath Abu Daoud
Chaimae Abouzahir
Leen Kharouf
Walid Al-Eisawi
Nizar Habash
Farah E. Shamout
LM&MA
201
5
0
06 May 2025
Which Agent Causes Task Failures and When? On Automated Failure Attribution of LLM Multi-Agent Systems
Shaokun Zhang
Ming Yin
Jieyu Zhang
Jing Liu
Zhiguang Han
...
Beibin Li
Chi Wang
Hongru Wang
Yuxiao Chen
Qingyun Wu
304
16
0
30 Apr 2025
Anyprefer: An Agentic Framework for Preference Data Synthesis
Yiyang Zhou
Zhaoxiang Wang
Tianle Wang
Shangyu Xing
Peng Xia
...
Chetan Bansal
Weitong Zhang
Ying Wei
Joey Tianyi Zhou
Huaxiu Yao
236
7
0
27 Apr 2025
Leveraging LLMs as Meta-Judges: A Multi-Agent Framework for Evaluating LLM Judgments
Yuante Li
Jama Hussein Mohamud
Chongren Sun
Di Wu
Benoit Boulet
LLMAG
ELM
171
1
0
23 Apr 2025
Multi-Stage Retrieval for Operational Technology Cybersecurity Compliance Using Large Language Models: A Railway Casestudy
Regan Bolton
Mohammadreza Sheikhfathollahi
Simon Parkinson
Dan Basher
Howard Parkinson
137
0
0
18 Apr 2025
LLM-as-a-Judge: Reassessing the Performance of LLMs in Extractive QA
Xanh Ho
Jiahao Huang
Florian Boudin
Akiko Aizawa
ELM
217
4
0
16 Apr 2025
MultiLoKo: a multilingual local knowledge benchmark for LLMs spanning 31 languages
Dieuwke Hupkes
Nikolay Bogoychev
549
3
0
14 Apr 2025
Multi-Agent LLM Judge: automatic personalized LLM judge design for evaluating natural language generation applications
Hongliu Cao
Ilias Driouich
Robin Singh
Eoin Thomas
ELM
147
2
0
01 Apr 2025
KOFFVQA: An Objectively Evaluated Free-form VQA Benchmark for Large Vision-Language Models in the Korean Language
Yoonshik Kim
Jaeyoon Jung
122
0
0
31 Mar 2025
Safer or Luckier? LLMs as Safety Evaluators Are Not Robust to Artifacts
Hongyu Chen
Seraphina Goldfarb-Tarrant
226
2
0
12 Mar 2025
DAFE: LLM-Based Evaluation Through Dynamic Arbitration for Free-Form Question-Answering
Sher Badshah
Hassan Sajjad
166
1
0
11 Mar 2025
Learning and generalization of robotic dual-arm manipulation of boxes from demonstrations via Gaussian Mixture Models (GMMs)
Qian Ying Lee
Suhas Raghavendra Kulkarni
Kenzhi Iskandar Wong
Lin Yang
Bernardo Noronha
Yongjun Wee
Tzu-Yi Hung
Domenico Campolo
121
0
0
07 Mar 2025
Revitalizing Saturated Benchmarks: A Weighted Metric Approach for Differentiating Large Language Model Performance
Bryan Etzine
Masoud Hashemi
Nishanth Madhusudhan
Sagar Davasam
Roshnee Sharma
Sathwik Tejaswi Madhusudhan
Vikas Yadav
105
1
0
07 Mar 2025
Validating LLM-as-a-Judge Systems under Rating Indeterminacy
Luke M. Guerdan
Solon Barocas
Kenneth Holstein
Hanna M. Wallach
Zhiwei Steven Wu
Alexandra Chouldechova
ALM
ELM
726
3
0
07 Mar 2025
LLMs Can Generate a Better Answer by Aggregating Their Own Responses
Zichong Li
Xinyu Feng
Yuheng Cai
Zixuan Zhang
Tianyi Liu
Chen Liang
Weizhu Chen
Haoyu Wang
Tiejun Zhao
LRM
175
3
0
06 Mar 2025
SEOE: A Scalable and Reliable Semantic Evaluation Framework for Open Domain Event Detection
Yi-Fan Lu
Xian-Ling Mao
Tian Lan
Tong Zhang
Yu-Shi Zhu
Heyan Huang
158
0
0
05 Mar 2025
NUTSHELL: A Dataset for Abstract Generation from Scientific Talks
Maike Züfle
Sara Papi
Beatrice Savoldi
Marco Gaido
L. Bentivogli
Jan Niehues
108
2
0
24 Feb 2025
Correlating and Predicting Human Evaluations of Language Models from Natural Language Processing Benchmarks
Rylan Schaeffer
Punit Singh Koura
Binh Tang
R. Subramanian
Aaditya K. Singh
...
Vedanuj Goswami
Sergey Edunov
Dieuwke Hupkes
Sanmi Koyejo
Sharan Narang
ALM
217
1
0
24 Feb 2025
Proactive Privacy Amnesia for Large Language Models: Safeguarding PII with Negligible Impact on Model Utility
Martin Kuo
Jingyang Zhang
Jianyi Zhang
Minxue Tang
Louis DiValentin
...
William Chen
Amin Hass
Tianlong Chen
Yuxiao Chen
Haoyang Li
MU
KELM
185
5
0
24 Feb 2025
Multi-Attribute Steering of Language Models via Targeted Intervention
Duy Nguyen
Archiki Prasad
Elias Stengel-Eskin
Joey Tianyi Zhou
LLMSV
218
6
0
18 Feb 2025
Towards Reasoning Ability of Small Language Models
Gaurav Srivastava
Shuxiang Cao
Xuan Wang
ReLM
LRM
214
14
0
17 Feb 2025
An Empirical Analysis of Uncertainty in Large Language Model Evaluations
Qiujie Xie
Qingqiu Li
Zhuohao Yu
Yuejie Zhang
Yue Zhang
Linyi Yang
ELM
185
8
0
15 Feb 2025
Combining Large Language Models with Static Analyzers for Code Review Generation
Imen Jaoua
Oussama Ben Sghaier
Houari Sahraoui
139
5
0
10 Feb 2025
Preference Leakage: A Contamination Problem in LLM-as-a-judge
Dawei Li
Renliang Sun
Yue Huang
Ming Zhong
Bohan Jiang
Jiawei Han
Wei Wei
Wei Wang
Huan Liu
259
44
0
03 Feb 2025
LLM-Powered Benchmark Factory: Reliable, Generic, and Efficient
Peiwen Yuan
Shaoxiong Feng
Yiwei Li
Xiaobei Wang
Y. Zhang
Jiayi Shi
Chuyi Tan
Boyuan Pan
Yao Hu
Kan Li
166
4
0
02 Feb 2025
From Cool Demos to Production-Ready FMware: Core Challenges and a Technology Roadmap
Gopi Krishnan Rajbahadur
G. Oliva
Dayi Lin
Ahmed E. Hassan
160
1
0
28 Jan 2025
Aligning with Human Judgement: The Role of Pairwise Preference in Large Language Model Evaluators
Yinhong Liu
Han Zhou
Zhijiang Guo
Ehsan Shareghi
Ivan Vulić
Anna Korhonen
Nigel Collier
ALM
363
91
0
20 Jan 2025
WarriorCoder: Learning from Expert Battles to Augment Code Large Language Models
Huawen Feng
Pu Zhao
Qingfeng Sun
Can Xu
Fangkai Yang
...
Qianli Ma
Qingwei Lin
Saravan Rajmohan
Dongmei Zhang
Qi Zhang
AAML
ALM
227
0
0
23 Dec 2024
KARRIEREWEGE: A Large Scale Career Path Prediction Dataset
Elena Senger
Yuri Campbell
Rob van der Goot
Barbara Plank
AI4TS
176
3
0
19 Dec 2024
Reasoning Through Execution: Unifying Process and Outcome Rewards for Code Generation
Zhuohao Yu
Weizheng Gu
Yidong Wang
Xingru Jiang
Zhengran Zeng
Jindong Wang
Wei Ye
Shikun Zhang
LRM
233
5
0
19 Dec 2024
JuStRank: Benchmarking LLM Judges for System Ranking
Ariel Gera
Odellia Boni
Yotam Perlitz
Roy Bar-Haim
Lilach Eden
Asaf Yehudai
ALM
ELM
239
9
0
12 Dec 2024
From Generation to Judgment: Opportunities and Challenges of LLM-as-a-judge
Dawei Li
Bohan Jiang
Liangjie Huang
Alimohammad Beigi
Chengshuai Zhao
...
Canyu Chen
Tianhao Wu
Kai Shu
Lu Cheng
Huan Liu
ELM
AILaw
535
163
0
25 Nov 2024
Do LLMs Agree on the Creativity Evaluation of Alternative Uses?
Abdullah Al Rabeyah
Fabrício Góes
Marco Volpe
Talles Medeiros
171
1
0
23 Nov 2024
Bayesian Calibration of Win Rate Estimation with LLM Evaluators
Yicheng Gao
G. Xu
Zhe Wang
Arman Cohan
117
6
0
07 Nov 2024
1
2
3
Next