LLM Evaluators Recognize and Favor Their Own Generations

15 April 2024

Arjun Panickssery

Samuel R. Bowman

Shi Feng

ArXiv (abs)PDF HTML HuggingFace (2 upvotes)

Papers citing "LLM Evaluators Recognize and Favor Their Own Generations"

50 / 154 papers shown

Uncertainty Quantification and Confidence Calibration in Large Language Models: A Survey

489

20 Mar 2025

OpeNLGauge: An Explainable Metric for NLG Evaluation with Open-Weights LLMs

364

14 Mar 2025

Adding Chocolate to Mint: Mitigating Metric Interference in Machine Translation

377

11 Mar 2025

Language Models Fail to Introspect About Their Knowledge of Language

408

10 Mar 2025

SwiLTra-Bench: The Swiss Legal Translation BenchmarkAnnual Meeting of the Association for Computational Linguistics (ACL), 2025

...

305

03 Mar 2025

Low-Confidence Gold: Refining Low-Confidence Samples for Efficient Instruction Tuning

Hongyi Cal

Jie Li

Mohammad Mahdinur Rahman

Wenzhen Dong

398

26 Feb 2025

Single- vs. Dual-Prompt Dialogue Generation with LLMs for Job Interviews in Human Resources

273

25 Feb 2025

Automatic Input Rewriting Improves Translation with Large Language ModelsNorth American Chapter of the Association for Computational Linguistics (NAACL), 2025

Dayeon Ki

Marine Carpuat

325

23 Feb 2025

CLIPPER: Compression enables long-context synthetic data generation

437

20 Feb 2025

RLTHF: Targeted Human Feedback for LLM Alignment

...

465

19 Feb 2025

Elucidating Mechanisms of Demographic Bias in LLMs for Healthcare

374

18 Feb 2025

AI Alignment at Your DiscretionConference on Fairness, Accountability and Transparency (FAccT), 2025

311

10 Feb 2025

AutoGUI: Scaling GUI Grounding with Automatic Functionality Annotations from LLMsAnnual Meeting of the Association for Computational Linguistics (ACL), 2025

985

04 Feb 2025

Preference Leakage: A Contamination Problem in LLM-as-a-judge

592

03 Feb 2025

Software Engineering and Foundation Models: Insights from Industry Blogs Using a Jury of Foundation Models

Hao Li

Cor-Paul Bezemer

Ahmed E. Hassan

314

08 Jan 2025

Exploring and Controlling Diversity in LLM-Agent Conversation

500

30 Dec 2024

LLM-based relevance assessment still can't replace human relevance assessmentInternational Workshop on Evaluating Information Access (EIA), 2024

Charles L. A. Clarke

Laura Dietz

ELM

197

22 Dec 2024

Evaluation of LLM Vulnerabilities to Being Misused for Personalized Disinformation GenerationAnnual Meeting of the Association for Computational Linguistics (ACL), 2024

Katarina Marcincinova

Matus Mesarcik

414

18 Dec 2024

QUENCH: Measuring the gap between Indic and Non-Indic Contextual General Reasoning in LLMsInternational Conference on Computational Linguistics (COLING), 2024

369

16 Dec 2024

JuStRank: Benchmarking LLM Judges for System RankingAnnual Meeting of the Association for Computational Linguistics (ACL), 2024

469

12 Dec 2024

VL-RewardBench: A Challenging Benchmark for Vision-Language Generative Reward ModelsComputer Vision and Pattern Recognition (CVPR), 2024

...

533

26 Nov 2024

From Generation to Judgment: Opportunities and Challenges of LLM-as-a-judge

...

1.1K

287

25 Nov 2024

Benchmarking LLMs' Judgments with No Gold StandardInternational Conference on Learning Representations (ICLR), 2024

194

11 Nov 2024

Evaluating Creative Short Story Generation in Humans and Large Language Models

530

04 Nov 2024

ProMQA: Question Answering Dataset for Multimodal Procedural Activity UnderstandingNorth American Chapter of the Association for Computational Linguistics (NAACL), 2024

Kimihiro Hasegawa

Wiradee Imrattanatrai

249

29 Oct 2024

BQA: Body Language Question Answering Dataset for Video Large Language ModelsAnnual Meeting of the Association for Computational Linguistics (ACL), 2024

423

17 Oct 2024

Limits to scalable evaluation at the frontier: LLM as Judge won't beat twice the dataInternational Conference on Learning Representations (ICLR), 2024

405

17 Oct 2024

Unlocking Legal Knowledge: A Multilingual Dataset for Judicial Summarization in Switzerland

Luca Rolshoven

Vishvaksenan Rasiah

Srinanda Brügger Bose

273

17 Oct 2024

MIRAGE-Bench: Automatic Multilingual Benchmark Arena for Retrieval-Augmented Generation SystemsNorth American Chapter of the Association for Computational Linguistics (NAACL), 2024

Nandan Thakur

Suleman Kazi

Ge Luo

Jimmy J. Lin

Amin Ahmad

VLM RALM

463

17 Oct 2024

LASeR: Learning to Adaptively Select Reward Models with Multi-Armed Bandits

439

02 Oct 2024

CRScore: Grounding Automated Evaluation of Code Review Comments in Code Claims and SmellsNorth American Chapter of the Association for Computational Linguistics (NAACL), 2024

326

29 Sep 2024

Direct Judgement Preference Optimization

367

23 Sep 2024

From Lists to Emojis: How Format Bias Affects Model AlignmentAnnual Meeting of the Association for Computational Linguistics (ACL), 2024

434

18 Sep 2024

PingPong: A Benchmark for Role-Playing Language Models with User Emulation and Multi-Model Evaluation

Ilya Gusev

LLMAG

505

10 Sep 2024

IQA-EVAL: Automatic Evaluation of Human-Model Interactive Question AnsweringNeural Information Processing Systems (NeurIPS), 2024

Xinya Du

243

24 Aug 2024

Systematic Evaluation of LLM-as-a-Judge in LLM Alignment Tasks: Explainable Metrics and Diverse Prompt Templates

494

23 Aug 2024

AcTracer: Active Testing of Large Language Model via Multi-Stage SamplingACM Transactions on Software Engineering and Methodology (TOSEM), 2024

348

07 Aug 2024

Self-Recognition in Language Models

Giuseppe Russo

527

09 Jul 2024

AI-AI Bias: large language models favor communications generated by large language models

209

09 Jul 2024

On scalable oversight with weak LLMs judging strong LLMs

...

Rohin Shah

307

05 Jul 2024

Evaluating the Ability of LLMs to Solve Semantics-Aware Process Mining Tasks

Goran Glavaš

181

02 Jul 2024

Compare without Despair: Reliable Preference Evaluation with Generation Separability

Sayan Ghosh

Tejas Srinivasan

Swabha Swayamdipta

290

02 Jul 2024

Free-text Rationale Generation under Readability Level Control

Yi-Sheng Hsu

Nils Feldhus

Sherzod Hakimov

466

01 Jul 2024

BigCodeBench: Benchmarking Code Generation with Diverse Function Calls and Complex Instructions

Wenhao Yu

...

David Lo

Daniel Fried

Xiaoning Du

H. D. Vries

Leandro von Werra

603

371

22 Jun 2024

PARIKSHA : A Large-Scale Investigation of Human-LLM Evaluator Agreement on Multilingual and Multi-Cultural Data

275

21 Jun 2024

Chumor 1.0: A Truly Funny and Challenging Chinese Humor Understanding Dataset from Ruo Zhi Ba

166

18 Jun 2024

DCA-Bench: A Benchmark for Dataset Curation Agents

363

11 Jun 2024

CRAG -- Comprehensive RAG BenchmarkNeural Information Processing Systems (NeurIPS), 2024

Xiao Yang

...

327

07 Jun 2024

Self-Control of LLM Behaviors by Compressing Suffix Gradient into Prefix Controller

Ziniu Hu

314

04 Jun 2024

Inverse Constitutional AI: Compressing Preferences into Principles

Eyke Hüllermeier

287

02 Jun 2024