ResearchTrend.AI
  • Communities
  • Connect sessions
  • AI calendar
  • Organizations
  • Join Slack
  • Contact Sales
Papers
Communities
Social Events
Terms and Conditions
Pricing
Contact Sales
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2026 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2404.13076
  4. Cited By
LLM Evaluators Recognize and Favor Their Own Generations

LLM Evaluators Recognize and Favor Their Own Generations

15 April 2024
Arjun Panickssery
Samuel R. Bowman
Shi Feng
ArXiv (abs)PDFHTMLHuggingFace (2 upvotes)

Papers citing "LLM Evaluators Recognize and Favor Their Own Generations"

50 / 154 papers shown
When Does Verification Pay Off? A Closer Look at LLMs as Solution Verifiers
When Does Verification Pay Off? A Closer Look at LLMs as Solution Verifiers
Jack Lu
Ryan Teehan
Jinran Jin
Mengye Ren
LRM
148
0
0
02 Dec 2025
Towards Active Synthetic Data Generation for Finetuning Language Models
Samuel Kessler
Menglin Xia
Daniel Madrigal Diaz
Dongge Han
Helia Heshemi
Saravan Rajmohan
Victor Ruehle
Jordan T. Ash
SyDa
173
0
0
30 Nov 2025
ShoppingComp: Are LLMs Really Ready for Your Shopping Cart?
ShoppingComp: Are LLMs Really Ready for Your Shopping Cart?
Huaixiao Tou
Ying Zeng
Cong Ma
Muzhi Li
Minghao Li
Weijie Yuan
He Zhang
Kai Jia
ELM
161
0
0
28 Nov 2025
AlignBench: Benchmarking Fine-Grained Image-Text Alignment with Synthetic Image-Caption Pairs
AlignBench: Benchmarking Fine-Grained Image-Text Alignment with Synthetic Image-Caption Pairs
Kuniaki Saito
Risa Shinoda
Shohei Tanaka
Tosho Hirasawa
Fumio Okura
Yoshitaka Ushiku
CoGeVLM
294
0
0
25 Nov 2025
Why Do Language Model Agents Whistleblow?
Why Do Language Model Agents Whistleblow?
Kushal Agrawal
Frank Xiao
Guido Bergman
Asa Cooper Stickland
LLMAG
324
0
0
21 Nov 2025
Image-POSER: Reflective RL for Multi-Expert Image Generation and Editing
Image-POSER: Reflective RL for Multi-Expert Image Generation and Editing
Hossein Mohebbi
Mohammed Abdulrahman
Yanting Miao
Pascal Poupart
Suraj Kothawade
DiffMOffRL
188
0
0
15 Nov 2025
Evaluating Online Moderation Via LLM-Powered Counterfactual Simulations
Evaluating Online Moderation Via LLM-Powered Counterfactual Simulations
Giacomo Fidone
Lucia Passaro
Riccardo Guidotti
295
0
0
10 Nov 2025
Self-Abstraction from Grounded Experience for Plan-Guided Policy Refinement
Self-Abstraction from Grounded Experience for Plan-Guided Policy Refinement
Hiroaki Hayashi
Bo Pang
Wenting Zhao
Ye Liu
Akash Gokul
Srijan Bansal
Caiming Xiong
Semih Yavuz
Yingbo Zhou
LLMAGLM&RoLRM
304
0
0
08 Nov 2025
Silenced Biases: The Dark Side LLMs Learned to Refuse
Silenced Biases: The Dark Side LLMs Learned to Refuse
Rom Himelstein
Amit Levi
Brit Youngmann
Yaniv Nemcovsky
A. Mendelson
110
2
0
05 Nov 2025
Evaluating Control Protocols for Untrusted AI Agents
Evaluating Control Protocols for Untrusted AI Agents
Jon Kutasov
Chloe Loughridge
Yuqi Sun
Henry Sleight
Buck Shlegeris
Tyler Tracy
Joe Benton
AAML
108
1
0
04 Nov 2025
Deep Value Benchmark: Measuring Whether Models Generalize Deep Values or Shallow Preferences
Deep Value Benchmark: Measuring Whether Models Generalize Deep Values or Shallow Preferences
Joshua Ashkinaze
Hua Shen
Sai Avula
Eric Gilbert
Ceren Budak
VLM
291
0
0
03 Nov 2025
AthenaBench: A Dynamic Benchmark for Evaluating LLMs in Cyber Threat Intelligence
AthenaBench: A Dynamic Benchmark for Evaluating LLMs in Cyber Threat Intelligence
Md Tanvirul Alam
Dipkamal Bhusal
Salman Ahmad
Nidhi Rastogi
Peter Worth
ELM
207
0
0
03 Nov 2025
Synthetic Eggs in Many Baskets: The Impact of Synthetic Data Diversity on LLM Fine-Tuning
Synthetic Eggs in Many Baskets: The Impact of Synthetic Data Diversity on LLM Fine-Tuning
Max Schaffelder
Albert Gatt
SyDa
147
1
0
03 Nov 2025
MemeArena: Automating Context-Aware Unbiased Evaluation of Harmfulness Understanding for Multimodal Large Language Models
MemeArena: Automating Context-Aware Unbiased Evaluation of Harmfulness Understanding for Multimodal Large Language Models
Zixin Chen
Hongzhan Lin
Kaixin Li
Ziyang Luo
Yayue Deng
Jing Ma
114
0
0
31 Oct 2025
CLINB: A Climate Intelligence Benchmark for Foundational Models
CLINB: A Climate Intelligence Benchmark for Foundational Models
Michelle Chen Huebscher
Katharine Mach
Aleksandar Stanić
Markus Leippold
Ben Gaiarin
...
Massimiliano Ciaramita
Joeri Rogelj
Christian Buck
Lierni Sestorain Saralegui
Reto Knutti
HILMELM
311
0
0
29 Oct 2025
Risk Management for Mitigating Benchmark Failure Modes: BenchRisk
Risk Management for Mitigating Benchmark Failure Modes: BenchRisk
Sean McGregor
Victor Lu
Vassil Tashev
Armstrong Foundjem
Aishwarya Ramasethu
...
Chris Knotz
Kongtao Chen
Alicia Parrish
Anka Reuel
Heather Frase
145
0
0
24 Oct 2025
Data-Centric Lessons To Improve Speech-Language Pretraining
Data-Centric Lessons To Improve Speech-Language Pretraining
Vishaal Udandarao
Zhiyun Lu
Xuankai Chang
Yongqiang Wang
Violet Z. Yao
Albin Madapally Jose
Fartash Faghri
Josh Gardner
Chung-Cheng Chiu
136
0
0
22 Oct 2025
Contrastive Decoding Mitigates Score Range Bias in LLM-as-a-Judge
Contrastive Decoding Mitigates Score Range Bias in LLM-as-a-Judge
Yoshinari Fujinuma
ELM
101
0
0
21 Oct 2025
Foundational Automatic Evaluators: Scaling Multi-Task Generative Evaluator Training for Reasoning-Centric Domains
Foundational Automatic Evaluators: Scaling Multi-Task Generative Evaluator Training for Reasoning-Centric Domains
Austin Xu
Xuan-Phi Nguyen
Yilun Zhou
Chien-Sheng Wu
Caiming Xiong
Shafiq Joty
OffRLALMLRMELM
221
0
0
20 Oct 2025
AI Debaters are More Persuasive when Arguing in Alignment with Their Own Beliefs
AI Debaters are More Persuasive when Arguing in Alignment with Their Own Beliefs
María Victoria Carro
Denise Alejandra Mester
Facundo Nieto
Oscar Agustín Stanchi
Guido Ernesto Bergman
...
Luca Nicolás Forziati Gangi
Francisca Gauna Selasco
Juan Gustavo Corvalán
Gerardo Simari
Maria Vanina Martinez
171
0
0
15 Oct 2025
LLM-REVal: Can We Trust LLM Reviewers Yet?
LLM-REVal: Can We Trust LLM Reviewers Yet?
Rui Li
Jia-Chen Gu
Po-Nien Kung
H. Xia
Junfeng Liu
Xiangwen Kong
Zhifang Sui
Nanyun Peng
132
0
0
14 Oct 2025
Do Psychometric Tests Work for Large Language Models? Evaluation of Tests on Sexism, Racism, and Morality
Do Psychometric Tests Work for Large Language Models? Evaluation of Tests on Sexism, Racism, and Morality
Jana Jung
Marlene Lutz
Indira Sen
M. Strohmaier
85
0
0
13 Oct 2025
Everything is Plausible: Investigating the Impact of LLM Rationales on Human Notions of Plausibility
Everything is Plausible: Investigating the Impact of LLM Rationales on Human Notions of Plausibility
Shramay Palta
Peter Rankel
Sarah Wiegreffe
Rachel Rudinger
136
0
0
09 Oct 2025
Everyone prefers human writers, including AI
Everyone prefers human writers, including AI
Wouter Haverals
Meredith Martin
101
0
0
09 Oct 2025
MLLM as a UI Judge: Benchmarking Multimodal LLMs for Predicting Human Perception of User Interfaces
MLLM as a UI Judge: Benchmarking Multimodal LLMs for Predicting Human Perception of User Interfaces
Reuben Luera
Ryan Rossi
Franck Dernoncourt
Samyadeep Basu
Sungchul Kim
...
Seunghyun Yoon
Jiuxiang Gu
Zichao Wang
Cindy Xiong Bearfield
Branislav Kveton
133
2
0
09 Oct 2025
ParsTranslit: Truly Versatile Tajik-Farsi Transliteration
ParsTranslit: Truly Versatile Tajik-Farsi Transliteration
Rayyan Merchant
Kevin Tang
85
0
0
08 Oct 2025
Multi-Agent Collaborative Intelligence: Dual-Dial Control for Reliable LLM Reasoning
Multi-Agent Collaborative Intelligence: Dual-Dial Control for Reliable LLM Reasoning
Edward Y. Chang
Ethan Chang
92
2
0
06 Oct 2025
GDPval: Evaluating AI Model Performance on Real-World Economically Valuable Tasks
GDPval: Evaluating AI Model Performance on Real-World Economically Valuable Tasks
Tejal Patwardhan
Rachel Dias
Elizabeth Proehl
Grace Kim
Michele Wang
...
David Li
Michael Sharman
Alexandra Barr
Amelia Glaese
Jerry Tworek
AI4TSALM
160
13
0
05 Oct 2025
Know Thyself? On the Incapability and Implications of AI Self-Recognition
Know Thyself? On the Incapability and Implications of AI Self-Recognition
Xiaoyan Bai
Aryan Shrivastava
Ari Holtzman
Chenhao Tan
SSL
223
1
0
03 Oct 2025
ToolTweak: An Attack on Tool Selection in LLM-based Agents
ToolTweak: An Attack on Tool Selection in LLM-based Agents
Jonathan Sneh
Ruomei Yan
Jialin Yu
Philip Torr
Y. Gal
Sunando Sengupta
Eric Sommerlade
Alasdair Paren
Adel Bibi
AAML
118
3
0
02 Oct 2025
Curiosity-Driven LLM-as-a-judge for Personalized Creative Judgment
Curiosity-Driven LLM-as-a-judge for Personalized Creative Judgment
Vanya Bannihatti Kumar
Divyanshu Goyal
Akhil Eppa
Neel Bhandari
ELMLRM
91
0
0
01 Oct 2025
The Silent Judge: Unacknowledged Shortcut Bias in LLM-as-a-Judge
The Silent Judge: Unacknowledged Shortcut Bias in LLM-as-a-Judge
Arash Marioriyad
M. Rohban
Mahdieh Soleymani Baghshah
ELM
148
1
0
30 Sep 2025
Human-MME: A Holistic Evaluation Benchmark for Human-Centric Multimodal Large Language Models
Human-MME: A Holistic Evaluation Benchmark for Human-Centric Multimodal Large Language Models
Yuansen Liu
Haiming Tang
Jinlong Peng
Jiangning Zhang
Xiaozhong Ji
...
Chaoyou Fu
Chengjie Wang
Chengjie Wang
Xiaobin Hu
Shuicheng Yan
VLM
241
1
0
30 Sep 2025
Deconstructing Self-Bias in LLM-generated Translation Benchmarks
Deconstructing Self-Bias in LLM-generated Translation Benchmarks
Wenda Xu
Sweta Agrawal
Vilém Zouhar
Markus Freitag
Daniel Deutsch
136
0
0
30 Sep 2025
Document Summarization with Conformal Importance Guarantees
Document Summarization with Conformal Importance Guarantees
Bruce Kuwahara
Chen-Yuan Lin
Xiao Shi Huang
Kin Kwan Leung
Jullian Arta Yapeter
Ilya Stanevich
Felipe Perez
Jesse C. Cresswell
AI4TS
175
0
0
24 Sep 2025
Evaluating LLM-Generated Versus Human-Authored Responses in Role-Play Dialogues
Evaluating LLM-Generated Versus Human-Authored Responses in Role-Play Dialogues
Dongxu Lu
Johan Jeuring
Albert Gatt
217
0
0
22 Sep 2025
Variation in Verification: Understanding Verification Dynamics in Large Language Models
Variation in Verification: Understanding Verification Dynamics in Large Language Models
Yefan Zhou
Austin Xu
Yilun Zhou
Janvijay Singh
Jiang Gui
Shafiq Joty
LRM
176
4
0
22 Sep 2025
Bringing Pedagogy into Focus: Evaluating Virtual Teaching Assistants' Question-Answering in Asynchronous Learning Environments
Bringing Pedagogy into Focus: Evaluating Virtual Teaching Assistants' Question-Answering in Asynchronous Learning Environments
Li Siyan
Zhen Xu
Vethavikashini Chithrra Raghuram
Xuanming Zhang
Renzhe Yu
Zhou Yu
121
0
0
22 Sep 2025
Catch Me If You Can? Not Yet: LLMs Still Struggle to Imitate the Implicit Writing Styles of Everyday Authors
Catch Me If You Can? Not Yet: LLMs Still Struggle to Imitate the Implicit Writing Styles of Everyday Authors
Zhengxiang Wang
Nafis Irtiza Tripto
Solha Park
Zhenzhen Li
Jiawei Zhou
100
1
0
18 Sep 2025
An AI-Powered Framework for Analyzing Collective Idea Evolution in Deliberative Assemblies
An AI-Powered Framework for Analyzing Collective Idea Evolution in Deliberative Assemblies
Elinor Poole-Dayan
Deb Roy
Jad Kabbara
86
0
0
16 Sep 2025
LLM-as-a-Judge: Rapid Evaluation of Legal Document Recommendation for Retrieval-Augmented Generation
LLM-as-a-Judge: Rapid Evaluation of Legal Document Recommendation for Retrieval-Augmented Generation
Anu Pradhan
Alexandra Ortan
Apurv Verma
Madhavan Seshadri
AILawELM
126
0
0
15 Sep 2025
Preservation of Language Understanding Capabilities in Speech-aware Large Language Models
Preservation of Language Understanding Capabilities in Speech-aware Large Language Models
Marek Kubis
Paweł Skórzewski
Iwona Christop
Mateusz Czyżnikiewicz
Jakub Kubiak
Łukasz Bondaruk
Marcin Lewandowski
AuLLMELM
182
0
0
15 Sep 2025
Calibrating MLLM-as-a-judge via Multimodal Bayesian Prompt Ensembles
Calibrating MLLM-as-a-judge via Multimodal Bayesian Prompt Ensembles
Eric Slyman
Mehrab Tanjim
Kushal Kafle
Stefan Lee
149
0
0
10 Sep 2025
On the Same Wavelength? Evaluating Pragmatic Reasoning in Language Models across Broad Concepts
On the Same Wavelength? Evaluating Pragmatic Reasoning in Language Models across Broad Concepts
Linlu Qiu
Cedegao E. Zhang
J. Tenenbaum
Yoon Kim
R. Levy
ReLMLRM
240
2
0
08 Sep 2025
X-SQL: Expert Schema Linking and Understanding of Text-to-SQL with Multi-LLMs
X-SQL: Expert Schema Linking and Understanding of Text-to-SQL with Multi-LLMs
Dazhi Peng
80
0
0
07 Sep 2025
ProMQA-Assembly: Multimodal Procedural QA Dataset on Assembly
ProMQA-Assembly: Multimodal Procedural QA Dataset on Assembly
Kimihiro Hasegawa
Wiradee Imrattanatrai
Masaki Asada
Susan Holm
Yuran Wang
Vincent Zhou
Ken Fukuda
Teruko Mitamura
156
2
0
03 Sep 2025
Breaking the Mirror: Activation-Based Mitigation of Self-Preference in LLM Evaluators
Breaking the Mirror: Activation-Based Mitigation of Self-Preference in LLM Evaluators
Dani Roytburg
Matthew Bozoukov
Matthew Nguyen
Jou Barzdukas
Simon Fu
Narmeen Oozeer
LLMSV
140
1
0
03 Sep 2025
ResearchQA: Evaluating Scholarly Question Answering at Scale Across 75 Fields with Survey-Mined Questions and Rubrics
ResearchQA: Evaluating Scholarly Question Answering at Scale Across 75 Fields with Survey-Mined Questions and Rubrics
Li S. Yifei
Allen Chang
Chaitanya Malaviya
Mark Yatskar
165
5
0
30 Aug 2025
Quantifying Label-Induced Bias in Large Language Model Self- and Cross-Evaluations
Quantifying Label-Induced Bias in Large Language Model Self- and Cross-Evaluations
Muskan Saraf
Sajjad Rezvani Boroujeni
Justin Beaudry
Hossein Abedi
Tom Bush
ALM
288
1
0
28 Aug 2025
The AI in the Mirror: LLM Self-Recognition in an Iterated Public Goods Game
The AI in the Mirror: LLM Self-Recognition in an Iterated Public Goods Game
Olivia Long
Carter Teplica
151
1
0
25 Aug 2025
1234
Next