Communities
Connect sessions
AI calendar
Organizations
Join Slack
Contact Sales

Terms and Conditions

Twitter GitHub LinkedIn Bluesky Youtube

© 2026 ResearchTrend.AI, All rights reserved.

Home
Papers
2404.13076
Cited By

LLM Evaluators Recognize and Favor Their Own Generations

LLM Evaluators Recognize and Favor Their Own Generations

15 April 2024

Arjun Panickssery

Samuel R. Bowman

ArXiv (abs)PDF HTML HuggingFace (2 upvotes)

Papers citing "LLM Evaluators Recognize and Favor Their Own Generations"

50 / 154 papers shown

When Does Verification Pay Off? A Closer Look at LLMs as Solution Verifiers

When Does Verification Pay Off? A Closer Look at LLMs as Solution Verifiers

148

0

0

02 Dec 2025

Towards Active Synthetic Data Generation for Finetuning Language Models

Daniel Madrigal Diaz

Saravan Rajmohan

173

0

0

30 Nov 2025

ShoppingComp: Are LLMs Really Ready for Your Shopping Cart?

ShoppingComp: Are LLMs Really Ready for Your Shopping Cart?

161

0

0

28 Nov 2025

AlignBench: Benchmarking Fine-Grained Image-Text Alignment with Synthetic Image-Caption Pairs

AlignBench: Benchmarking Fine-Grained Image-Text Alignment with Synthetic Image-Caption Pairs

Yoshitaka Ushiku

294

0

0

25 Nov 2025

Why Do Language Model Agents Whistleblow?

Why Do Language Model Agents Whistleblow?

Asa Cooper Stickland

324

0

0

21 Nov 2025

Image-POSER: Reflective RL for Multi-Expert Image Generation and Editing

Image-POSER: Reflective RL for Multi-Expert Image Generation and Editing

Hossein Mohebbi

Mohammed Abdulrahman

Suraj Kothawade

188

0

0

15 Nov 2025

Evaluating Online Moderation Via LLM-Powered Counterfactual Simulations

Evaluating Online Moderation Via LLM-Powered Counterfactual Simulations

Riccardo Guidotti

295

0

0

10 Nov 2025

Self-Abstraction from Grounded Experience for Plan-Guided Policy Refinement

Self-Abstraction from Grounded Experience for Plan-Guided Policy Refinement

Hiroaki Hayashi

LLMAG LM&Ro LRM

304

0

0

08 Nov 2025

Silenced Biases: The Dark Side LLMs Learned to Refuse

Silenced Biases: The Dark Side LLMs Learned to Refuse

Yaniv Nemcovsky

110

2

0

05 Nov 2025

Evaluating Control Protocols for Untrusted AI Agents

Evaluating Control Protocols for Untrusted AI Agents

Chloe Loughridge

108

1

0

04 Nov 2025

Deep Value Benchmark: Measuring Whether Models Generalize Deep Values or Shallow Preferences

Deep Value Benchmark: Measuring Whether Models Generalize Deep Values or Shallow Preferences

Joshua Ashkinaze

291

0

0

03 Nov 2025

AthenaBench: A Dynamic Benchmark for Evaluating LLMs in Cyber Threat Intelligence

AthenaBench: A Dynamic Benchmark for Evaluating LLMs in Cyber Threat Intelligence

Md Tanvirul Alam

Dipkamal Bhusal

207

0

0

03 Nov 2025

Synthetic Eggs in Many Baskets: The Impact of Synthetic Data Diversity on LLM Fine-Tuning

Synthetic Eggs in Many Baskets: The Impact of Synthetic Data Diversity on LLM Fine-Tuning

Max Schaffelder

147

1

0

03 Nov 2025

MemeArena: Automating Context-Aware Unbiased Evaluation of Harmfulness Understanding for Multimodal Large Language Models

MemeArena: Automating Context-Aware Unbiased Evaluation of Harmfulness Understanding for Multimodal Large Language Models

114

0

0

31 Oct 2025

CLINB: A Climate Intelligence Benchmark for Foundational Models

CLINB: A Climate Intelligence Benchmark for Foundational Models

Michelle Chen Huebscher

Aleksandar Stanić

Markus Leippold

...

Massimiliano Ciaramita

Lierni Sestorain Saralegui

311

0

0

29 Oct 2025

Risk Management for Mitigating Benchmark Failure Modes: BenchRisk

Risk Management for Mitigating Benchmark Failure Modes: BenchRisk

Armstrong Foundjem

Aishwarya Ramasethu

...

145

0

0

24 Oct 2025

Data-Centric Lessons To Improve Speech-Language Pretraining

Data-Centric Lessons To Improve Speech-Language Pretraining

Vishaal Udandarao

Albin Madapally Jose

Chung-Cheng Chiu

136

0

0

22 Oct 2025

Contrastive Decoding Mitigates Score Range Bias in LLM-as-a-Judge

Contrastive Decoding Mitigates Score Range Bias in LLM-as-a-Judge

Yoshinari Fujinuma

101

0

0

21 Oct 2025

Foundational Automatic Evaluators: Scaling Multi-Task Generative Evaluator Training for Reasoning-Centric Domains

Foundational Automatic Evaluators: Scaling Multi-Task Generative Evaluator Training for Reasoning-Centric Domains

Xuan-Phi Nguyen

OffRL ALM LRM ELM

221

0

0

20 Oct 2025

AI Debaters are More Persuasive when Arguing in Alignment with Their Own Beliefs

AI Debaters are More Persuasive when Arguing in Alignment with Their Own Beliefs

María Victoria Carro

Denise Alejandra Mester

Oscar Agustín Stanchi

Guido Ernesto Bergman

...

Luca Nicolás Forziati Gangi

Francisca Gauna Selasco

Juan Gustavo Corvalán

Maria Vanina Martinez

171

0

0

15 Oct 2025

LLM-REVal: Can We Trust LLM Reviewers Yet?

LLM-REVal: Can We Trust LLM Reviewers Yet?

132

0

0

14 Oct 2025

Do Psychometric Tests Work for Large Language Models? Evaluation of Tests on Sexism, Racism, and Morality

Do Psychometric Tests Work for Large Language Models? Evaluation of Tests on Sexism, Racism, and Morality

85

0

0

13 Oct 2025

Everything is Plausible: Investigating the Impact of LLM Rationales on Human Notions of Plausibility

Everything is Plausible: Investigating the Impact of LLM Rationales on Human Notions of Plausibility

Sarah Wiegreffe

Rachel Rudinger

136

0

0

09 Oct 2025

Everyone prefers human writers, including AI

Everyone prefers human writers, including AI

Wouter Haverals

Meredith Martin

101

0

0

09 Oct 2025

MLLM as a UI Judge: Benchmarking Multimodal LLMs for Predicting Human Perception of User Interfaces

MLLM as a UI Judge: Benchmarking Multimodal LLMs for Predicting Human Perception of User Interfaces

Franck Dernoncourt

...

Cindy Xiong Bearfield

Branislav Kveton

133

2

0

09 Oct 2025

ParsTranslit: Truly Versatile Tajik-Farsi Transliteration

ParsTranslit: Truly Versatile Tajik-Farsi Transliteration

Rayyan Merchant

85

0

0

08 Oct 2025

Multi-Agent Collaborative Intelligence: Dual-Dial Control for Reliable LLM Reasoning

Multi-Agent Collaborative Intelligence: Dual-Dial Control for Reliable LLM Reasoning

Edward Y. Chang

92

2

0

06 Oct 2025

GDPval: Evaluating AI Model Performance on Real-World Economically Valuable Tasks

GDPval: Evaluating AI Model Performance on Real-World Economically Valuable Tasks

Tejal Patwardhan

Elizabeth Proehl

...

Michael Sharman

160

13

0

05 Oct 2025

Know Thyself? On the Incapability and Implications of AI Self-Recognition

Know Thyself? On the Incapability and Implications of AI Self-Recognition

Aryan Shrivastava

223

1

0

03 Oct 2025

ToolTweak: An Attack on Tool Selection in LLM-based Agents

ToolTweak: An Attack on Tool Selection in LLM-based Agents

Sunando Sengupta

Eric Sommerlade

118

3

0

02 Oct 2025

Curiosity-Driven LLM-as-a-judge for Personalized Creative Judgment

Curiosity-Driven LLM-as-a-judge for Personalized Creative Judgment

Vanya Bannihatti Kumar

Divyanshu Goyal

91

0

0

01 Oct 2025

The Silent Judge: Unacknowledged Shortcut Bias in LLM-as-a-Judge

The Silent Judge: Unacknowledged Shortcut Bias in LLM-as-a-Judge

Arash Marioriyad

Mahdieh Soleymani Baghshah

148

1

0

30 Sep 2025

Human-MME: A Holistic Evaluation Benchmark for Human-Centric Multimodal Large Language Models

Human-MME: A Holistic Evaluation Benchmark for Human-Centric Multimodal Large Language Models

Jiangning Zhang

...

241

1

0

30 Sep 2025

Deconstructing Self-Bias in LLM-generated Translation Benchmarks

Deconstructing Self-Bias in LLM-generated Translation Benchmarks

136

0

0

30 Sep 2025

Document Summarization with Conformal Importance Guarantees

Document Summarization with Conformal Importance Guarantees

Jullian Arta Yapeter

Jesse C. Cresswell

175

0

0

24 Sep 2025

Evaluating LLM-Generated Versus Human-Authored Responses in Role-Play Dialogues

Evaluating LLM-Generated Versus Human-Authored Responses in Role-Play Dialogues

217

0

0

22 Sep 2025

Variation in Verification: Understanding Verification Dynamics in Large Language Models

Variation in Verification: Understanding Verification Dynamics in Large Language Models

176

4

0

22 Sep 2025

Bringing Pedagogy into Focus: Evaluating Virtual Teaching Assistants' Question-Answering in Asynchronous Learning Environments

Bringing Pedagogy into Focus: Evaluating Virtual Teaching Assistants' Question-Answering in Asynchronous Learning Environments

Vethavikashini Chithrra Raghuram

121

0

0

22 Sep 2025

Catch Me If You Can? Not Yet: LLMs Still Struggle to Imitate the Implicit Writing Styles of Everyday Authors

Catch Me If You Can? Not Yet: LLMs Still Struggle to Imitate the Implicit Writing Styles of Everyday Authors

Zhengxiang Wang

Nafis Irtiza Tripto

100

1

0

18 Sep 2025

An AI-Powered Framework for Analyzing Collective Idea Evolution in Deliberative Assemblies

An AI-Powered Framework for Analyzing Collective Idea Evolution in Deliberative Assemblies

Elinor Poole-Dayan

86

0

0

16 Sep 2025

LLM-as-a-Judge: Rapid Evaluation of Legal Document Recommendation for Retrieval-Augmented Generation

LLM-as-a-Judge: Rapid Evaluation of Legal Document Recommendation for Retrieval-Augmented Generation

Alexandra Ortan

Madhavan Seshadri

126

0

0

15 Sep 2025

Preservation of Language Understanding Capabilities in Speech-aware Large Language Models

Preservation of Language Understanding Capabilities in Speech-aware Large Language Models

Paweł Skórzewski

Mateusz Czyżnikiewicz

Łukasz Bondaruk

Marcin Lewandowski

182

0

0

15 Sep 2025

Calibrating MLLM-as-a-judge via Multimodal Bayesian Prompt Ensembles

Calibrating MLLM-as-a-judge via Multimodal Bayesian Prompt Ensembles

149

0

0

10 Sep 2025

On the Same Wavelength? Evaluating Pragmatic Reasoning in Language Models across Broad Concepts

On the Same Wavelength? Evaluating Pragmatic Reasoning in Language Models across Broad Concepts

Cedegao E. Zhang

240

2

0

08 Sep 2025

X-SQL: Expert Schema Linking and Understanding of Text-to-SQL with Multi-LLMs

X-SQL: Expert Schema Linking and Understanding of Text-to-SQL with Multi-LLMs

80

0

0

07 Sep 2025

ProMQA-Assembly: Multimodal Procedural QA Dataset on Assembly

ProMQA-Assembly: Multimodal Procedural QA Dataset on Assembly

Kimihiro Hasegawa

Wiradee Imrattanatrai

Teruko Mitamura

156

2

0

03 Sep 2025

Breaking the Mirror: Activation-Based Mitigation of Self-Preference in LLM Evaluators

Breaking the Mirror: Activation-Based Mitigation of Self-Preference in LLM Evaluators

Matthew Bozoukov

140

1

0

03 Sep 2025

ResearchQA: Evaluating Scholarly Question Answering at Scale Across 75 Fields with Survey-Mined Questions and Rubrics

ResearchQA: Evaluating Scholarly Question Answering at Scale Across 75 Fields with Survey-Mined Questions and Rubrics

Chaitanya Malaviya

165

5

0

30 Aug 2025

Quantifying Label-Induced Bias in Large Language Model Self- and Cross-Evaluations

Quantifying Label-Induced Bias in Large Language Model Self- and Cross-Evaluations

Sajjad Rezvani Boroujeni

288

1

0

28 Aug 2025

The AI in the Mirror: LLM Self-Recognition in an Iterated Public Goods Game

The AI in the Mirror: LLM Self-Recognition in an Iterated Public Goods Game

151

1

0

25 Aug 2025