Communities
Connect sessions
AI calendar
Organizations
Join Slack
Contact Sales
Search
Open menu
Home
Papers
2406.04244
Cited By
Benchmark Data Contamination of Large Language Models: A Survey
6 June 2024
Cheng Xu
Shuhao Guan
Derek Greene
Mohand-Tahar Kechadi
ELM
ALM
Re-assign community
ArXiv (abs)
PDF
HTML
Papers citing
"Benchmark Data Contamination of Large Language Models: A Survey"
50 / 62 papers shown
Title
On the Limits of Innate Planning in Large Language Models
Charles Schepanowski
Charles Ling
LLMAG
LRM
ELM
361
0
0
26 Nov 2025
MACEval: A Multi-Agent Continual Evaluation Network for Large Models
Z. Chen
Yuze Sun
Yuan Tian
Wenjun Zhang
Guangtao Zhai
ALM
ELM
112
0
0
12 Nov 2025
Towards a Standard, Enterprise-Relevant Agentic AI Benchmark: Lessons from 5.5 billion tokens' worth of agentic AI evaluations
JV Roig
40
0
0
11 Nov 2025
ArenaBencher: Automatic Benchmark Evolution via Multi-Model Competitive Evaluation
Qin Liu
Jacob Dineen
Y. Huang
Sheng Zhang
Hoifung Poon
Ben Zhou
Muhao Chen
ELM
108
0
0
09 Oct 2025
Detecting Distillation Data from Reasoning Models
H. Zhang
Hyeong Kyu Choi
Yixuan Li
Hongxin Wei
89
0
0
06 Oct 2025
On The Fragility of Benchmark Contamination Detection in Reasoning Models
Han Wang
Haoyu Li
Brian Ko
Huan Zhang
84
1
0
30 Sep 2025
Beyond the Leaderboard: Understanding Performance Disparities in Large Language Models via Model Diffing
Sabri Boughorbel
Fahim Dalvi
Nadir Durrani
Majd Hawasly
60
0
0
23 Sep 2025
SWE-Bench Pro: Can AI Agents Solve Long-Horizon Software Engineering Tasks?
Xiang Deng
Jeff Da
Edwin Pan
Yannis Yiming He
Charles Ide
...
Bing Liu
Chen Bo Calvin Zhang
Noah Jacobson
Bing Liu
Brad Kenstler
101
8
0
21 Sep 2025
Limited Reference, Reliable Generation: A Two-Component Framework for Tabular Data Generation in Low-Data Regimes
Mingxuan Jiang
Yongxin Wang
Ziyue Dai
Yicun Liu
Hongyi Nie
Sen Liu
Hongfeng Chai
DiffM
92
0
0
12 Sep 2025
Evaluation Awareness Scales Predictably in Open-Weights Large Language Models
Maheep Chaudhary
Ian Su
Nikhil Hooda
Nishith Shankar
Julia Tan
Kevin Zhu
Ashwinee Panda
Ryan Lagasse
Sean O Brien
ELM
52
0
0
10 Sep 2025
The Non-Determinism of Small LLMs: Evidence of Low Answer Consistency in Repetition Trials of Standard Multiple-Choice Benchmarks
Claudio S. Pinhanez
Paulo Cavalin
Cassia Sanctos
Marcelo Grave
Yago Primerano
57
0
0
05 Sep 2025
Beyond Memorization: Reasoning-Driven Synthesis as a Mitigation Strategy Against Benchmark Contamination
Terry Jingchen Zhang
Gopal Dev
Ning Wang
Nicole Ni
Wenyuan Jiang
Mubashara Akhtar
Bernhard Schölkopf
Mrinmaya Sachan
Zhijing Jin
136
0
0
26 Aug 2025
STEM: Efficient Relative Capability Evaluation of LLMs through Structured Transition Samples
Haiquan Hu
Jiazhi Jiang
Shiyou Xu
Ruhan Zeng
Tian Wang
68
0
0
16 Aug 2025
LLMEval-3: A Large-Scale Longitudinal Study on Robust and Fair Evaluation of Large Language Models
Ming-bo Wen
Yujiong Shen
Jingyi Deng
Yuhui Wang
Yue Zhang
...
Zhiheng Xi
Mingxu Chai
Tao Liang
Zhihui Fei
Zhen Wang
ELM
ALM
157
0
0
07 Aug 2025
Llama-3.1-FoundationAI-SecurityLLM-8B-Instruct Technical Report
Sajana Weerawardhena
Paul Kassianik
Blaine Nelson
Baturay Saglam
Anu Vellore
...
Dhruv Kedia
Kojin Oshiba
Zhouran Yang
Yaron Singer
Amin Karbasi
ALM
ELM
148
4
0
01 Aug 2025
LLM-Crowdsourced: A Benchmark-Free Paradigm for Mutual Evaluation of Large Language Models
Q. Guo
Wei Xie
Xiaofang Cai
Enze Wang
Shuoyoucheng Ma
Kai Chen
Xiaofeng Wang
Baosheng Wang
ELM
136
0
0
30 Jul 2025
The Impact of Fine-tuning Large Language Models on Automated Program Repair
Roman Macháček
Anastasiia Grishina
Max Hort
Leon Moonen
100
1
0
26 Jul 2025
How Much Do Large Language Model Cheat on Evaluation? Benchmarking Overestimation under the One-Time-Pad-Based Framework
Zi Liang
Liantong Yu
Shiyu Zhang
Qingqing Ye
Haibo Hu
ELM
147
1
0
25 Jul 2025
DCR: Quantifying Data Contamination in LLMs Evaluation
Cheng Xu
Nan Yan
Shuhao Guan
Changhong Jin
Yuke Mei
Yibing Guo
Mohand-Tahar Kechadi
125
1
0
15 Jul 2025
Beyond Parameters: Exploring Virtual Logic Depth for Scaling Laws
Ruike Zhu
Hanwen Zhang
Kevin Li
Tianyu Shi
Y. Duan
Chi Wang
Tianyi Zhou
Arindam Banerjee
Zengyi Qin
VLM
LRM
128
0
0
23 Jun 2025
LastingBench: Defend Benchmarks Against Knowledge Leakage
Yixiong Fang
Tianran Sun
Yuling Shi
Min Wang
Xiaodong Gu
KELM
196
4
0
21 Jun 2025
MDBench: A Synthetic Multi-Document Reasoning Benchmark Generated with Knowledge Guidance
Annual Meeting of the Association for Computational Linguistics (ACL), 2025
Joseph Peper
Wenzhao Qiu
Ali Payani
Lu Wang
114
0
0
17 Jun 2025
OIBench: Benchmarking Strong Reasoning Models with Olympiad in Informatics
Yaoming Zhu
Junxin Wang
Yiyang Li
Lin Qiu
Zongyu Wang
...
Xuezhi Cao
Yuhuai Wei
Mingshi Wang
Xunliang Cai
Rong Ma
LRM
268
3
0
12 Jun 2025
Multidimensional Analysis of Specific Language Impairment Using Unsupervised Learning Through PCA and Clustering
IEEE International Conference on Healthcare Informatics (ICHI), 2025
Niruthiha Selvanayagam
148
0
0
05 Jun 2025
RARE: Retrieval-Aware Robustness Evaluation for Retrieval-Augmented Generation Systems
Yixiao Zeng
Tianyu Cao
Danqing Wang
Xinran Zhao
Zimeng Qiu
Morteza Ziyadi
Tongshuang Wu
Lei Li
RALM
199
1
0
01 Jun 2025
Can LLMs Reason Structurally? An Evaluation via the Lens of Data Structures
Yu He
Yingxi Li
Colin White
Ellen Vitercik
ELM
LRM
150
1
0
29 May 2025
PreP-OCR: A Complete Pipeline for Document Image Restoration and Enhanced OCR Accuracy
Annual Meeting of the Association for Computational Linguistics (ACL), 2025
Shuhao Guan
Moule Lin
Cheng Xu
Xinyi Liu
Jinman Zhao
Jiexin Fan
Qi Xu
Derek Greene
286
4
0
26 May 2025
AI4Math: A Native Spanish Benchmark for University-Level Mathematical Reasoning in Large Language Models
Miguel Angel Peñaloza Perez
Bruno Lopez Orozco
Jesus Tadeo Cruz Soto
Michelle Bruno Hernandez
Miguel Angel Alvarado Gonzalez
Sandra Malagon
LRM
ELM
108
1
0
25 May 2025
Generative AI and Creativity: A Systematic Literature Review and Meta-Analysis
Niklas Holzner
Sebastian Maier
Stefan Feuerriegel
168
9
0
22 May 2025
Protoknowledge Shapes Behaviour of LLMs in Downstream Tasks: Memorization and Generalization with Knowledge Graphs
Federico Ranaldi
Andrea Zugarini
Leonardo Ranaldi
Fabio Massimo Zanzotto
111
1
0
21 May 2025
An Empirical Study of Many-to-Many Summarization with Large Language Models
Annual Meeting of the Association for Computational Linguistics (ACL), 2025
Jiaan Wang
Fandong Meng
Zengkui Sun
Yunlong Liang
Yuxuan Cao
Jiarong Xu
Haoxiang Shi
Jie Zhou
176
0
0
19 May 2025
Generative Induction of Dialogue Task Schemas with Streaming Refinement and Simulated Interactions
James D. Finch
Yasasvi Josyula
Jinho Choi
144
0
0
25 Apr 2025
Automatically Generating Rules of Malicious Software Packages via Large Language Model
Dependable Systems and Networks (DSN), 2025
XiangRui Zhang
HaoYu Chen
YongZhong He
Wenjia Niu
Qiang Li
153
1
0
24 Apr 2025
Hypothetical Documents or Knowledge Leakage? Rethinking LLM-based Query Expansion
Annual Meeting of the Association for Computational Linguistics (ACL), 2025
Yejun Yoon
Jaeyoon Jung
Seunghyun Yoon
Kunwoo Park
281
1
0
19 Apr 2025
From Stability to Inconsistency: A Study of Moral Preferences in LLMs
Monika Jotautaite
Mary Phuong
Chatrik Singh Mangat
Maria Angelica Martinez
98
0
0
08 Apr 2025
WinoWhat: A Parallel Corpus of Paraphrased WinoGrande Sentences with Common Sense Categorization
I. Gevers
Victor De Marez
Luna De Bruyne
Walter Daelemans
188
1
0
31 Mar 2025
Unlocking the Potential of Past Research: Using Generative AI to Reconstruct Healthcare Simulation Models
Journal of the Operational Research Society (JORS), 2025
Thomas Monks
Alison Harper
Amy Heather
214
1
0
27 Mar 2025
The Emperor's New Clothes in Benchmarking? A Rigorous Examination of Mitigation Strategies for LLM Benchmark Data Contamination
Yifan Sun
Han Wang
Dongbai Li
Gang Wang
Huan Zhang
AAML
252
4
0
20 Mar 2025
Framing the Game: How Context Shapes LLM Decision-Making
Isaac Robinson
John Burden
180
1
0
05 Mar 2025
Reasoning about Affordances: Causal and Compositional Reasoning in LLMs
Magnus F. Gjerde
Vanessa Cheung
David Lagnado
ReLM
LRM
211
0
0
23 Feb 2025
Recent Advances in Large Langauge Model Benchmarks against Data Contamination: From Static to Dynamic Evaluation
Simin Chen
Yiming Chen
Zexin Li
Yifan Jiang
Zhongwei Wan
...
Dezhi Ran
Tianle Gu
Haoyang Li
Tao Xie
Baishakhi Ray
240
18
0
23 Feb 2025
Social Genome: Grounded Social Reasoning Abilities of Multimodal Models
Leena Mathur
Marian Qian
Paul Pu Liang
Louis-Philippe Morency
LRM
1.0K
13
0
21 Feb 2025
Large Language Models Badly Generalize across Option Length, Problem Types, and Irrelevant Noun Replacements
Guangxiang Zhao
Saier Hu
Xiaoqi Jian
Jinzhu Wu
Yuhan Wu
Change Jia
Lin Sun
Xiangzheng Zhang
320
2
0
18 Feb 2025
Can We Trust AI Benchmarks? An Interdisciplinary Review of Current Issues in AI Evaluation
Maria Eriksson
Erasmo Purificato
Arman Noroozian
Joao Vinagre
Guillaume Chaslot
Emilia Gomez
David Fernandez-Llorca
ELM
658
22
0
10 Feb 2025
Preference Leakage: A Contamination Problem in LLM-as-a-judge
Dawei Li
Renliang Sun
Yue Huang
Ming Zhong
Bohan Jiang
Jiawei Han
Wei Wei
Wei Wang
Huan Liu
482
64
0
03 Feb 2025
Multi-Step Reasoning in Korean and the Emergent Mirage
Guijin Son
Hyunwoo Ko
Dasol Choi
LRM
ReLM
266
2
0
10 Jan 2025
Are LLMs Prescient? A Continuous Evaluation using Daily News as the Oracle
Hui Dai
Ryan Teehan
Mengye Ren
KELM
AIFin
ELM
212
7
0
13 Nov 2024
Does Data Contamination Detection Work (Well) for LLMs? A Survey and Evaluation on Detection Assumptions
North American Chapter of the Association for Computational Linguistics (NAACL), 2024
Yujuan Fu
Özlem Uzuner
Meliha Yetisgen
Fei Xia
346
16
0
24 Oct 2024
CAP: Data Contamination Detection via Consistency Amplification
Yi Zhao
Jing Li
Linyi Yang
99
1
0
19 Oct 2024
HEALTH-PARIKSHA: Assessing RAG Models for Health Chatbots in Real-World Multilingual Settings
Varun Gumma
Anandhita Raghunath
Mohit Jain
Sunayana Sitaram
LM&MA
188
5
0
17 Oct 2024
1
2
Next