Investigating Data Contamination in Modern Benchmarks for Large Language
Models

Investigating Data Contamination in Modern Benchmarks for Large Language Models

16 November 2023

Mark B. Gerstein

Arman Cohan

Papers citing "Investigating Data Contamination in Modern Benchmarks for Large Language Models"

15 / 15 papers shown

Title
LLM-Evaluation Tropes: Perspectives on the Validity of LLM-Evaluations Laura Dietz Oleg Zendel P. Bailey Charles L. A. Clarke Ellese Cotterill Jeff Dalton Faegheh Hasibi Mark Sanderson Nick Craswell ELM 43 0 0 27 Apr 2025
Repurposing the scientific literature with vision-language models Anton Alyakin Jaden Stryker Daniel Alber Karl L. Sangwon Brandon Duderstadt ... Laura Snyder Eric Leuthardt Douglas Kondziolka E. Oermann Eric Karl Oermann 92 0 0 26 Feb 2025
MathGAP: Out-of-Distribution Evaluation on Problems with Arbitrarily Complex Proofs Andreas Opedal Haruki Shirakami Bernhard Schölkopf Abulhair Saparov Mrinmaya Sachan LRM 54 1 0 17 Feb 2025
Mind the Data Gap: Bridging LLMs to Enterprise Data Integration Moe Kayali Fabian Wenz Nesime Tatbul Çağatay Demiralp 44 2 0 31 Dec 2024
Both Text and Images Leaked! A Systematic Analysis of Multimodal LLM Data Contamination D. Song Sicheng Lai Shunian Chen Lichao Sun Benyou Wang 56 0 0 06 Nov 2024
Does Data Contamination Detection Work (Well) for LLMs? A Survey and Evaluation on Detection Assumptions Yujuan Fu Özlem Uzuner Meliha Yetisgen Fei Xia 52 3 0 24 Oct 2024
How Much Can We Forget about Data Contamination? Sebastian Bordt Suraj Srinivas Valentyn Boreiko U. V. Luxburg 41 1 0 04 Oct 2024
PingPong: A Benchmark for Role-Playing Language Models with User Emulation and Multi-Model Evaluation Ilya Gusev LLMAG 52 3 0 10 Sep 2024
Hallucination-Free? Assessing the Reliability of Leading AI Legal Research Tools Varun Magesh Faiz Surani Matthew Dahl Mirac Suzgun Christopher D. Manning Daniel E. Ho HILM ELM AILaw 24 63 0 30 May 2024
Black-Box Access is Insufficient for Rigorous AI Audits Stephen Casper Carson Ezell Charlotte Siegmann Noam Kolt Taylor Lynn Curtis ... Michael Gerovitch David Bau Max Tegmark David M. Krueger Dylan Hadfield-Menell AAML 13 75 0 25 Jan 2024
Don't Make Your LLM an Evaluation Benchmark Cheater Kun Zhou Yutao Zhu Zhipeng Chen Wentong Chen Wayne Xin Zhao Xu Chen Yankai Lin Ji-Rong Wen Jiawei Han ELM 105 136 0 03 Nov 2023
Data Contamination Through the Lens of Time Manley Roberts Himanshu Thakur Christine Herlihy Colin White Samuel Dooley 81 30 0 16 Oct 2023
Can we trust the evaluation on ChatGPT? Rachith Aiyappa Jisun An Haewoon Kwak Yong-Yeol Ahn ELM ALM LLMAG AI4MH LRM 106 76 0 22 Mar 2023
Training language models to follow instructions with human feedback Long Ouyang Jeff Wu Xu Jiang Diogo Almeida Carroll L. Wainwright ... Amanda Askell Peter Welinder Paul Christiano Jan Leike Ryan J. Lowe OSLM ALM 301 11,730 0 04 Mar 2022
The Pile: An 800GB Dataset of Diverse Text for Language Modeling Leo Gao Stella Biderman Sid Black Laurence Golding Travis Hoppe ... Horace He Anish Thite Noa Nabeshima Shawn Presser Connor Leahy AIMat 245 1,977 0 31 Dec 2020