ResearchTrend.AI
  • Communities
  • Connect sessions
  • AI calendar
  • Organizations
  • Join Slack
  • Contact Sales
Papers
Communities
Social Events
Terms and Conditions
Pricing
Contact Sales
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2502.01534
  4. Cited By
Preference Leakage: A Contamination Problem in LLM-as-a-judge
v1v2 (latest)

Preference Leakage: A Contamination Problem in LLM-as-a-judge

3 February 2025
Dawei Li
Renliang Sun
Yue Huang
Ming Zhong
Bohan Jiang
Jiawei Han
Wei Wei
Wei Wang
Huan Liu
ArXiv (abs)PDFHTMLHuggingFace (41 upvotes)

Papers citing "Preference Leakage: A Contamination Problem in LLM-as-a-judge"

50 / 116 papers shown
Title
Do LLM-judges Align with Human Relevance in Cranfield-style Recommender Evaluation?
Do LLM-judges Align with Human Relevance in Cranfield-style Recommender Evaluation?
Gustavo Penha
Aleksandr V. Petrov
C. Hauff
Enrico Palumbo
Ali Vardasbi
...
Alice Wang
Praveen Chandar
Henrik Lindström
Hugues Bouchard
M. Lalmas
36
0
0
28 Nov 2025
LungNoduleAgent: A Collaborative Multi-Agent System for Precision Diagnosis of Lung Nodules
LungNoduleAgent: A Collaborative Multi-Agent System for Precision Diagnosis of Lung Nodules
Cheng Yang
Hui Jin
Xinlei Yu
Zhipeng Wang
Y. Liu
Fenglei Fan
Dajiang Lei
Gangyong Jia
Changmiao Wang
Ruiquan Ge
69
0
0
26 Nov 2025
Beyond Task-Oriented and Chitchat Dialogues: Proactive and Transition-Aware Conversational Agents
Beyond Task-Oriented and Chitchat Dialogues: Proactive and Transition-Aware Conversational Agents
Yejin Yoon
Yuri Son
Namyoung So
Minseo Kim
Minsoo Cho
Chanhee Park
Seungshin Lee
Taeuk Kim
LLMAG
102
0
0
11 Nov 2025
\texttt{ReMind}: Understanding Deductive Code Reasoning in LLMs
\texttt{ReMind}: Understanding Deductive Code Reasoning in LLMs
Jun Gao
Yun Peng
Xiaoxue Ren
LRM
117
0
0
01 Nov 2025
MemeArena: Automating Context-Aware Unbiased Evaluation of Harmfulness Understanding for Multimodal Large Language Models
MemeArena: Automating Context-Aware Unbiased Evaluation of Harmfulness Understanding for Multimodal Large Language Models
Zixin Chen
Hongzhan Lin
Kaixin Li
Ziyang Luo
Yayue Deng
Jing Ma
106
0
0
31 Oct 2025
Seeing, Signing, and Saying: A Vision-Language Model-Assisted Pipeline for Sign Language Data Acquisition and Curation from Social Media
Seeing, Signing, and Saying: A Vision-Language Model-Assisted Pipeline for Sign Language Data Acquisition and Curation from Social Media
Shakib Yazdani
Yasser Hamidullah
C. España-Bonet
Josef van Genabith
SLR
206
1
0
29 Oct 2025
Approximating Human Preferences Using a Multi-Judge Learned System
Approximating Human Preferences Using a Multi-Judge Learned System
Eitán Sprejer
Fernando Avalos
Augusto Bernardi
Jose Pedro Brito de Azevedo Faustino
Jacob Haimes
Narmeen Fatimah Oozeer
48
0
0
29 Oct 2025
mmWalk: Towards Multi-modal Multi-view Walking Assistance
mmWalk: Towards Multi-modal Multi-view Walking Assistance
Kedi Ying
R. Liu
Chongyan Chen
Mingzhe Tao
Hao-miao Shi
Kailun Yang
Jiaming Zhang
Rainer Stiefelhagen
107
0
0
13 Oct 2025
Who's Your Judge? On the Detectability of LLM-Generated Judgments
Who's Your Judge? On the Detectability of LLM-Generated Judgments
Dawei Li
Zhen Tan
Chengshuai Zhao
Bohan Jiang
Baixiang Huang
Pingchuan Ma
Abdullah Alnaibari
Kai Shu
Huan Liu
161
0
0
29 Sep 2025
TrustJudge: Inconsistencies of LLM-as-a-Judge and How to Alleviate Them
TrustJudge: Inconsistencies of LLM-as-a-Judge and How to Alleviate Them
Yidong Wang
Yunze Song
Tingyuan Zhu
X. Zhang
Zhuohao Yu
...
Zhen Wu
Xinyu Dai
Yue Zhang
Wei Ye
Shikun Zhang
ALM
214
0
0
25 Sep 2025
Internalizing Self-Consistency in Language Models: Multi-Agent Consensus Alignment
Internalizing Self-Consistency in Language Models: Multi-Agent Consensus Alignment
Ankur Samanta
Akshayaa Magesh
Youliang Yu
Runzhe Wu
Ayush Jain
Daniel Jiang
Boris Vidolov
Paul Sajda
Yonathan Efroni
Kaveh Hassani
LLMAGLRM
269
0
0
18 Sep 2025
Neither Valid nor Reliable? Investigating the Use of LLMs as Judges
Neither Valid nor Reliable? Investigating the Use of LLMs as Judges
Khaoula Chehbouni
Mohammed Haddou
Jackie CK Cheung
G. Farnadi
LLMAG
313
6
0
25 Aug 2025
What Matters in Data for DPO?
What Matters in Data for DPO?
Yu Pan
Zhongze Cai
Guanting Chen
Huaiyang Zhong
Chonghuan Wang
236
3
0
23 Aug 2025
MCP-Universe: Benchmarking Large Language Models with Real-World Model Context Protocol Servers
MCP-Universe: Benchmarking Large Language Models with Real-World Model Context Protocol Servers
Ziyang Luo
Zhiqi Shen
Wenzhuo Yang
Zirui Zhao
Prathyusha Jwalapuram
Amrita Saha
Doyen Sahoo
Silvio Savarese
Caiming Xiong
Junnan Li
ELM
172
21
0
20 Aug 2025
LongRecall: A Structured Approach for Robust Recall Evaluation in Long-Form Text
LongRecall: A Structured Approach for Robust Recall Evaluation in Long-Form Text
MohamamdJavad Ardestani
Ehsan Kamalloo
Davood Rafiei
92
1
0
20 Aug 2025
Lexical Hints of Accuracy in LLM Reasoning Chains
Lexical Hints of Accuracy in LLM Reasoning Chains
Arne Vanhoyweghen
Brecht Verbeken
Andres Algaba
Vincent Ginis
117
1
0
19 Aug 2025
Are Today's LLMs Ready to Explain Well-Being Concepts?
Are Today's LLMs Ready to Explain Well-Being Concepts?
Bohan Jiang
Dawei Li
Zhen Tan
Chengshuai Zhao
Huan Liu
AI4MH
157
0
0
06 Aug 2025
Beyond the Surface: Enhancing LLM-as-a-Judge Alignment with Human via Internal Representations
Beyond the Surface: Enhancing LLM-as-a-Judge Alignment with Human via Internal Representations
Peng Lai
Jianjie Zheng
Sijie Cheng
Yun-Nung Chen
Peng Li
Yang Liu
Guanhua Chen
171
1
0
05 Aug 2025
Can Small-Scale Data Poisoning Exacerbate Dialect-Linked Biases in Large Language Models?
Can Small-Scale Data Poisoning Exacerbate Dialect-Linked Biases in Large Language Models?
Chaymaa Abbas
Mariette Awad
Razane Tajeddine
81
0
0
25 Jul 2025
DCR: Quantifying Data Contamination in LLMs Evaluation
DCR: Quantifying Data Contamination in LLMs Evaluation
Cheng Xu
Nan Yan
Shuhao Guan
Changhong Jin
Yuke Mei
Yibing Guo
Mohand-Tahar Kechadi
153
1
0
15 Jul 2025
One Token to Fool LLM-as-a-Judge
One Token to Fool LLM-as-a-Judge
Yulai Zhao
Haolin Liu
Dian Yu
Sunyuan Kung
Meijia Chen
Haitao Mi
Dong Yu
OffRLLRM
205
19
0
11 Jul 2025
Quantitative LLM Judges
Quantitative LLM Judges
Aishwarya Sahoo
Jeevana Kruthi Karnuthala
Tushar Parmanand Budhwani
Pranchal Agarwal
Sankaran Vaidyanathan
...
Jennifer Healey
Nedim Lipka
Ryan Rossi
Uttaran Bhattacharya
Branislav Kveton
ELM
285
1
0
03 Jun 2025
Beyond the Surface: Measuring Self-Preference in LLM Judgments
Beyond the Surface: Measuring Self-Preference in LLM Judgments
Zhi-Yuan Chen
Hao Wang
Xinyu Zhang
Enrui Hu
Yankai Lin
153
4
0
03 Jun 2025
ResearchCodeBench: Benchmarking LLMs on Implementing Novel Machine Learning Research Code
ResearchCodeBench: Benchmarking LLMs on Implementing Novel Machine Learning Research Code
Tianyu Hua
Harper Hua
Robert Z. Sparks
Benjamin Klieger
Sang T. Truong
Weixin Liang
Fan-Yun Sun
Nick Haber
143
4
0
02 Jun 2025
Towards Conversational Development Environments: Using Theory-of-Mind and Multi-Agent Architectures for Requirements Refinement
Towards Conversational Development Environments: Using Theory-of-Mind and Multi-Agent Architectures for Requirements Refinement
Keheliya Gallaba
Ali Arabat
Dayi Lin
Mohammed Sayagh
Ahmed E. Hassan
AI4CE
265
1
0
27 May 2025
CODE-DITING: A Reasoning-Based Metric for Functional Alignment in Code Evaluation
CODE-DITING: A Reasoning-Based Metric for Functional Alignment in Code Evaluation
Guang Yang
Yu Zhou
Xiang Chen
Wei-Shi Zheng
Xing Hu
Xin Zhou
David Lo
Taolue Chen
ALMLRM
190
5
0
26 May 2025
DOGe: Defensive Output Generation for LLM Protection Against Knowledge Distillation
DOGe: Defensive Output Generation for LLM Protection Against Knowledge Distillation
Pingzhi Li
Zhen Tan
Huaizhi Qu
Huan Liu
Tianlong Chen
Tianlong Chen
AAML
230
3
0
26 May 2025
Judging with Many Minds: Do More Perspectives Mean Less Prejudice? On Bias Amplifications and Resistance in Multi-Agent Based LLM-as-Judge
Judging with Many Minds: Do More Perspectives Mean Less Prejudice? On Bias Amplifications and Resistance in Multi-Agent Based LLM-as-Judge
Chiyu Ma
Enpei Zhang
Yilun Zhao
Wenjun Liu
Yaning Jia
Peijun Qing
Lin Shi
Arman Cohan
Yujun Yan
Soroush Vosoughi
LLMAGELM
391
3
0
26 May 2025
The Quest for Efficient Reasoning: A Data-Centric Benchmark to CoT Distillation
The Quest for Efficient Reasoning: A Data-Centric Benchmark to CoT Distillation
Ruichen Zhang
Rana Muhammad Shahroz Khan
Zhen Tan
Dawei Li
Song Wang
Tianlong Chen
LRM
236
1
0
24 May 2025
Adaptive Prediction-Powered AutoEval with Reliability and Efficiency Guarantees
Adaptive Prediction-Powered AutoEval with Reliability and Efficiency Guarantees
Sangwoo Park
Matteo Zecchin
Osvaldo Simeone
178
2
0
24 May 2025
Understanding and Mitigating Overrefusal in LLMs from an Unveiling Perspective of Safety Decision Boundary
Understanding and Mitigating Overrefusal in LLMs from an Unveiling Perspective of Safety Decision Boundary
Licheng Pan
Yongqi Tong
Xin Zhang
Xiaolu Zhang
Jun Zhou
Zhixuan Chu
246
1
0
23 May 2025
SophiaVL-R1: Reinforcing MLLMs Reasoning with Thinking Reward
SophiaVL-R1: Reinforcing MLLMs Reasoning with Thinking Reward
Kaixuan Fan
Kaituo Feng
Haoming Lyu
Dongzhan Zhou
Xiangyu Yue
ReLMLRM
286
22
0
22 May 2025
OpenEthics: A Comprehensive Ethical Evaluation of Open-Source Generative Large Language Models
OpenEthics: A Comprehensive Ethical Evaluation of Open-Source Generative Large Language Models
Burak Erinç Çetin
Yıldırım Özen
Elif Naz Demiryılmaz
Kaan Engür
Cagri Toraman
ELM
184
1
0
21 May 2025
YESciEval: Robust LLM-as-a-Judge for Scientific Question Answering
YESciEval: Robust LLM-as-a-Judge for Scientific Question AnsweringAnnual Meeting of the Association for Computational Linguistics (ACL), 2025
Jennifer D'Souza
Hamed Babaei Giglou
Quentin Münch
ELM
405
5
0
20 May 2025
Truth or Twist? Optimal Model Selection for Reliable Label Flipping Evaluation in LLM-based Counterfactuals
Truth or Twist? Optimal Model Selection for Reliable Label Flipping Evaluation in LLM-based Counterfactuals
Qianli Wang
Van Bach Nguyen
Nils Feldhus
Luis Felipe Villa-Arenas
Christin Seifert
Sebastian Möller
Vera Schmitt
328
0
0
20 May 2025
DRP: Distilled Reasoning Pruning with Skill-aware Step Decomposition for Efficient Large Reasoning Models
DRP: Distilled Reasoning Pruning with Skill-aware Step Decomposition for Efficient Large Reasoning Models
Yuxuan Jiang
Dawei Li
Frank Ferraro
LRM
502
10
0
20 May 2025
Krikri: Advancing Open Large Language Models for Greek
Krikri: Advancing Open Large Language Models for Greek
Dimitris Roussis
Leon Voukoutis
Georgios Paraskevopoulos
Sokratis Sofianopoulos
Prokopis Prokopidis
Vassilis Papavasileiou
Athanasios Katsamanis
Stelios Piperidis
Vassilis Katsouros
ALM
345
5
0
19 May 2025
A Survey on Privacy Risks and Protection in Large Language Models
A Survey on Privacy Risks and Protection in Large Language ModelsJournal of King Saud University: Computer and Information Sciences (J. King Saud Univ. Comput. Inf. Sci.), 2025
Kang Chen
Xiuze Zhou
Yuanguo Lin
Shibo Feng
Li Shen
Pengcheng Wu
AILawPILM
1.1K
15
0
04 May 2025
LLM-Evaluation Tropes: Perspectives on the Validity of LLM-Evaluations
LLM-Evaluation Tropes: Perspectives on the Validity of LLM-Evaluations
Laura Dietz
Oleg Zendel
P. Bailey
Charles L. A. Clarke
Ellese Cotterill
Jeff Dalton
Faegheh Hasibi
Mark Sanderson
Nick Craswell
ELM
255
9
0
27 Apr 2025
DONOD: Robust and Generalizable Instruction Fine-Tuning for LLMs via Model-Intrinsic Dataset Pruning
DONOD: Robust and Generalizable Instruction Fine-Tuning for LLMs via Model-Intrinsic Dataset Pruning
Jucheng Hu
Steve Yang
Dongzhan Zhou
Lijun Wu
190
2
0
21 Apr 2025
Consensus Entropy: Harnessing Multi-VLM Agreement for Self-Verifying and Self-Improving OCR
Consensus Entropy: Harnessing Multi-VLM Agreement for Self-Verifying and Self-Improving OCR
Yujiao Shi
Tianyi Liang
Xinyue Huang
Erfei Cui
Xu Guo
Pei Chu
Chenhui Li
Ru Zhang
Wenhai Wang
Gongshen Liu
621
4
0
15 Apr 2025
CHARM: Calibrating Reward Models With Chatbot Arena Scores
CHARM: Calibrating Reward Models With Chatbot Arena Scores
Xiao Zhu
Chenmien Tan
Pinzhen Chen
Rico Sennrich
Yanlin Zhang
Hanxu Hu
ALM
282
4
0
14 Apr 2025
A Comprehensive Survey of Reward Models: Taxonomy, Applications, Challenges, and Future
A Comprehensive Survey of Reward Models: Taxonomy, Applications, Challenges, and Future
Jialun Zhong
Wei Shen
Yanzeng Li
Songyang Gao
Hua Lu
Yicheng Chen
Yang Zhang
Wei Zhou
Jinjie Gu
Lei Zou
LRM
324
27
0
12 Apr 2025
Do LLM Evaluators Prefer Themselves for a Reason?
Do LLM Evaluators Prefer Themselves for a Reason?
Wei-Lin Chen
Zhepei Wei
Xinyu Zhu
Shi Feng
Yu Meng
ELMLRM
253
21
0
04 Apr 2025
Beyond Accuracy: The Role of Calibration in Self-Improving Large Language Models
Beyond Accuracy: The Role of Calibration in Self-Improving Large Language Models
Liangjie Huang
Dawei Li
Huan Liu
Lu Cheng
LRM
334
0
0
03 Apr 2025
MAVERIX: Multimodal Audio-Visual Evaluation and Recognition IndeX
MAVERIX: Multimodal Audio-Visual Evaluation and Recognition IndeX
Liuyue Xie
George Z. Wei
Avik Kuthiala
Ce Zheng
Ananya Bal
...
Rohan Choudhury
Morteza Ziyadi
Xu Zhang
Hao Yang
László A. Jeni
243
1
0
27 Mar 2025
CodeIF-Bench: Evaluating Instruction-Following Capabilities of Large Language Models in Interactive Code Generation
CodeIF-Bench: Evaluating Instruction-Following Capabilities of Large Language Models in Interactive Code Generation
Peiding Wang
Lulu Zhang
Fang Liu
Lin Shi
Minxiao Li
Bo Shen
An Fu
ELMLRM
854
10
0
05 Mar 2025
What do Large Language Models Say About Animals? Investigating Risks of Animal Harm in Generated Text
What do Large Language Models Say About Animals? Investigating Risks of Animal Harm in Generated TextConference on Fairness, Accountability and Transparency (FAccT), 2025
Arturs Kanepajs
Aditi Basu
Sankalpa Ghose
Constance Li
Akshat Mehta
Ronak Mehta
Samuel David Tucker-Davis
Eric Zhou
Bob Fischer
Jacy Reese Anthis
ELMALM
405
6
0
03 Mar 2025
The Relationship Between Reasoning and Performance in Large Language Models -- o3 (mini) Thinks Harder, Not Longer
The Relationship Between Reasoning and Performance in Large Language Models -- o3 (mini) Thinks Harder, Not Longer
Marthe Ballon
Andres Algaba
Vincent Ginis
LRMReLM
281
32
0
24 Feb 2025
BPO: Towards Balanced Preference Optimization between Knowledge Breadth and Depth in AlignmentNorth American Chapter of the Association for Computational Linguistics (NAACL), 2024
Sizhe Wang
Yongqi Tong
Hengyuan Zhang
Dawei Li
Xin Zhang
Tianlong Chen
435
14
0
21 Feb 2025
123
Next