LLM Evaluators Recognize and Favor Their Own Generations

15 April 2024

Arjun Panickssery

Samuel R. Bowman

Shi Feng

ArXiv (abs)PDF HTML HuggingFace (2 upvotes)

Papers citing "LLM Evaluators Recognize and Favor Their Own Generations"

50 / 155 papers shown

The AI in the Mirror: LLM Self-Recognition in an Iterated Public Goods Game

Olivia Long

Carter Teplica

162

25 Aug 2025

Hermes 4 Technical Report

128

25 Aug 2025

LLMs that Understand Processes: Instruction-tuning for Semantics-Aware Process Mining

Vira Pyrih

Adrian Rebmann

Han van der Aa

160

22 Aug 2025

STREAM (ChemBio): A Standard for Transparently Reporting Evaluations in AI Model Reports

267

13 Aug 2025

MeteorPred: A Meteorological Multimodal Large Model and Dataset for Severe Weather Event Prediction

213

09 Aug 2025

Play Favorites: A Statistical Method to Measure Self-Bias in LLM-as-a-Judge

Evangelia Spiliopoulou

159

08 Aug 2025

Learning an Efficient Multi-Turn Dialogue Evaluator from Multiple LLM Judges

Huajun Chen

204

01 Aug 2025

Cascaded Information Disclosure for Generalized Evaluation of Problem Solving Capabilities

191

31 Jul 2025

Counterfactual Evaluation for Blind Attack Detection in LLM-based Evaluation Systems

142

31 Jul 2025

AraTable: Benchmarking LLMs' Reasoning and Understanding of Arabic Tabular Data

149

24 Jul 2025

Marcel: A Lightweight and Open-Source Conversational Agent for University Student Support

Jan Trienes

Anastasiia Derzhanskaia

129

18 Jul 2025

GEMeX-RMCoT: An Enhanced Med-VQA Dataset for Region-Aware Multimodal Chain-of-Thought Reasoning

215

22 Jun 2025

InstructTTSEval: Benchmarking Complex Natural-Language Instruction Following in Text-to-Speech Systems

231

19 Jun 2025

Correlated Errors in Large Language Models

236

09 Jun 2025

Right Is Not Enough: The Pitfalls of Outcome Supervision in Training LLMs for Math Reasoning

231

07 Jun 2025

BioMol-MQA: A Multi-Modal Question Answering Dataset For LLM Reasoning Over Bio-Molecular Interactions

210

06 Jun 2025

Explainer-guided Targeted Adversarial Attacks against Binary Code Similarity Detection Models

202

05 Jun 2025

RedDebate: Safer Responses through Multi-Agent Red Teaming Debates

265

04 Jun 2025

Toward Reliable VLM: A Fine-Grained Benchmark and Framework for Exposure, Bias, and Inference in Korean Street Views

Xiaonan Wang

Bo Shao

Hansaem Kim

202

03 Jun 2025

Contextual Integrity in LLMs via Reasoning and Reinforcement Learning

Christopher G. Brinton

Robert Sim

196

29 May 2025

LASER: Stratified Selective Sampling for Instruction Tuning with Dedicated Scoring Strategy

Paramita Mirza

Lucas Weber

Fabian Küch

287

28 May 2025

Silencer: From Discovery to Mitigation of Self-Bias in LLM-as-Benchmark-Generator

223

27 May 2025

Towards Conversational Development Environments: Using Theory-of-Mind and Multi-Agent Architectures for Requirements Refinement

283

27 May 2025

Assistant-Guided Mitigation of Teacher Preference Bias in LLM-as-a-Judge

415

25 May 2025

Large Language Models for Predictive Analysis: How Far Are They?Annual Meeting of the Association for Computational Linguistics (ACL), 2025

272

22 May 2025

Walk&Retrieve: Simple Yet Effective Zero-shot Retrieval-Augmented Generation via Knowledge Graph Walks

318

22 May 2025

InfoDeepSeek: Benchmarking Agentic Information Seeking for Retrieval-Augmented Generation

...

356

21 May 2025

LEXam: Benchmarking Legal Reasoning on 340 Law Exams

...

547

19 May 2025

R3: Robust Rubric-Agnostic Reward Models

David Anugraha

Zilu Tang

Lester James V. Miranda

Hanyang Zhao

Mohammad Rifqi Farhansyah

Garry Kuwanto

Derry Wijaya

Genta Indra Winata

634

19 May 2025

SIMPLEMIX: Frustratingly Simple Mixing of Off- and On-policy Data in Language Model Preference Learning

Tianjian Li

Daniel Khashabi

333

05 May 2025

LecEval: An Automated Metric for Multimodal Knowledge Acquisition in Multimedia Learning

...

190

04 May 2025

TRUST: An LLM-Based Dialogue System for Trauma Understanding and Structured Assessments

249

30 Apr 2025

RepliBench: Evaluating the Autonomous Replication Capabilities of Language Model Agents

1.0K

21 Apr 2025

LoRe: Personalizing LLMs via Low-Rank Reward Modeling

293

20 Apr 2025

Multi-Stage Retrieval for Operational Technology Cybersecurity Compliance Using Large Language Models: A Railway Casestudy

Regan Bolton

Mohammadreza Sheikhfathollahi

Simon Parkinson

Dan Basher

Howard Parkinson

239

18 Apr 2025

LLM-as-a-Judge: Reassessing the Performance of LLMs in Extractive QA

349

16 Apr 2025

Cancer-Myth: Evaluating Large Language Models on Patient Questions with False Presuppositions

381

15 Apr 2025

NorEval: A Norwegian Language Understanding and Generation Evaluation BenchmarkAnnual Meeting of the Association for Computational Linguistics (ACL), 2025

Vladislav Mikhailov

Tita Ranveig Enstad

David Samuel

Hans Christian Farsethås

400

10 Apr 2025

AI-Slop to AI-Polish? Aligning Language Models through Edit-Based Writing Rewards and Test-time Computation

Tuhin Chakrabarty

Philippe Laban

Chien-Sheng Wu

472

10 Apr 2025

Societal Impacts Research Requires Benchmarks for Creative Composition Tasks

Judy Hanwen Shen

Carlos Guestrin

614

09 Apr 2025

Socrates or Smartypants: Testing Logic Reasoning Capabilities of Large Language Models with Logic Programming-based Test Oracles

369

09 Apr 2025

Self-Adaptive Cognitive Debiasing for Large Language Models in Decision-Making

747

05 Apr 2025

Verification of Autonomous Neural Car Control with KeYmaera XInternational Conference on Abstract State Machines, Alloy, B, TLA, VDM, and Z (ABZ), 2025

Enguerrand Prebet

Samuel Teuber

André Platzer

264

04 Apr 2025

Do LLM Evaluators Prefer Themselves for a Reason?

359

04 Apr 2025

An Illusion of Progress? Assessing the Current State of Web Agents

878

02 Apr 2025

Rubrik's Cube: Testing a New Rubric for Evaluating Explanations on the CUBE datasetAnnual Meeting of the Association for Computational Linguistics (ACL), 2025

392

31 Mar 2025

Citegeist: Automated Generation of Related Work Analysis on the arXiv Corpus

Claas Beger

Carl-Leander Henneking

242

29 Mar 2025

Evaluating book summaries from internal knowledge in Large Language Models: a cross-model and semantic consistency approach

Javier Coronado-Blázquez

HILM ELM

283

27 Mar 2025

3MDBench: Medical Multimodal Multi-agent Dialogue Benchmark

349

26 Mar 2025

Enhancing Product Search Interfaces with Sketch-Guided Diffusion and Language AgentsThe Web Conference (WWW), 2025

Edward Sun

DiffM

255

21 Mar 2025