v1v2 (latest)

Preference Leakage: A Contamination Problem in LLM-as-a-judge

3 February 2025

ArXiv (abs)PDF HTML HuggingFace (41 upvotes)

Papers citing "Preference Leakage: A Contamination Problem in LLM-as-a-judge"

17 / 117 papers shown

Benchmarking Cognitive Biases in Large Language Models as EvaluatorsAnnual Meeting of the Association for Computational Linguistics (ACL), 2023

317

124

29 Sep 2023

Time Travel in LLMs: Tracing Data Contamination in Large Language ModelsInternational Conference on Learning Representations (ICLR), 2023

Shahriar Golchin

Mihai Surdeanu

446

144

16 Aug 2023

AgentBench: Evaluating LLMs as AgentsInternational Conference on Learning Representations (ICLR), 2023

...

527

494

07 Aug 2023

Won't Get Fooled Again: Answering Questions with False PremisesAnnual Meeting of the Association for Computational Linguistics (ACL), 2023

Zhiyuan Liu

Maosong Sun

224

05 Jul 2023

Assisting Language Learners: Automated Trans-Lingual Definition Generation via Contrastive Prompt LearningWorkshop on Innovative Use of NLP for Building Educational Applications (UNBEA), 2023

294

09 Jun 2023

Judging LLM-as-a-Judge with MT-Bench and Chatbot ArenaNeural Information Processing Systems (NeurIPS), 2023

...

3.2K

6,557

09 Jun 2023

A New Dataset and Empirical Study for Sentence Simplification in ChineseAnnual Meeting of the Association for Computational Linguistics (ACL), 2023

Shiping Yang

Renliang Sun

Xiao-Yi Wan

256

07 Jun 2023

Direct Preference Optimization: Your Language Model is Secretly a Reward ModelNeural Information Processing Systems (NeurIPS), 2023

Christopher D. Manning

Chelsea Finn

ALM

864

6,697

29 May 2023

OpenAssistant Conversations -- Democratizing Large Language Model AlignmentNeural Information Processing Systems (NeurIPS), 2023

...

767

783

14 Apr 2023

Human-like Summarization Evaluation with ChatGPT

Xiaojun Wan

201

169

05 Apr 2023

GPT-4 Technical Report

...

4.6K

20,717

15 Mar 2023

Towards a Unified Multi-Dimensional Evaluator for Text GenerationConference on Empirical Methods in Natural Language Processing (EMNLP), 2022

Yang Liu

Heng Ji

250

327

13 Oct 2022

TruthfulQA: Measuring How Models Mimic Human FalsehoodsAnnual Meeting of the Association for Computational Linguistics (ACL), 2021

1.6K

2,670

08 Sep 2021

Documenting Large Webtext Corpora: A Case Study on the Colossal Clean Crawled CorpusConference on Empirical Methods in Natural Language Processing (EMNLP), 2021

Dirk Groeneveld

309

562

18 Apr 2021

Memorization vs. Generalization: Quantifying Data Leakage in NLP Performance EvaluationConference of the European Chapter of the Association for Computational Linguistics (EACL), 2021

Aparna Elangovan

Jiayuan He

Karin Verspoor

TDI FedML

344

107

03 Feb 2021

BERTScore: Evaluating Text Generation with BERT

2.4K

7,458

21 Apr 2019

How NOT To Evaluate Your Dialogue System: An Empirical Study of Unsupervised Evaluation Metrics for Dialogue Response Generation

384

1,358

25 Mar 2016