ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2406.04244
  4. Cited By
Benchmark Data Contamination of Large Language Models: A Survey

Benchmark Data Contamination of Large Language Models: A Survey

6 June 2024
Cheng Xu
Shuhao Guan
Derek Greene
Mohand-Tahar Kechadi
    ELM
    ALM
ArXivPDFHTML

Papers citing "Benchmark Data Contamination of Large Language Models: A Survey"

46 / 46 papers shown
Title
Generative Induction of Dialogue Task Schemas with Streaming Refinement and Simulated Interactions
Generative Induction of Dialogue Task Schemas with Streaming Refinement and Simulated Interactions
James D. Finch
Yasasvi Josyula
Jinho D. Choi
33
0
0
25 Apr 2025
Automatically Generating Rules of Malicious Software Packages via Large Language Model
Automatically Generating Rules of Malicious Software Packages via Large Language Model
XiangRui Zhang
HaoYu Chen
YongZhong He
Wenjia Niu
Qiang Li
25
0
0
24 Apr 2025
Hypothetical Documents or Knowledge Leakage? Rethinking LLM-based Query Expansion
Hypothetical Documents or Knowledge Leakage? Rethinking LLM-based Query Expansion
Yejun Yoon
Jaeyoon Jung
Seunghyun Yoon
Kunwoo Park
17
0
0
19 Apr 2025
From Stability to Inconsistency: A Study of Moral Preferences in LLMs
From Stability to Inconsistency: A Study of Moral Preferences in LLMs
Monika Jotautaite
Mary Phuong
Chatrik Singh Mangat
Maria Angelica Martinez
19
0
0
08 Apr 2025
WinoWhat: A Parallel Corpus of Paraphrased WinoGrande Sentences with Common Sense Categorization
WinoWhat: A Parallel Corpus of Paraphrased WinoGrande Sentences with Common Sense Categorization
I. Gevers
Victor De Marez
Luna De Bruyne
Walter Daelemans
32
0
0
31 Mar 2025
Unlocking the Potential of Past Research: Using Generative AI to Reconstruct Healthcare Simulation Models
Unlocking the Potential of Past Research: Using Generative AI to Reconstruct Healthcare Simulation Models
Thomas Monks
Alison Harper
Amy Heather
39
0
0
27 Mar 2025
The Emperor's New Clothes in Benchmarking? A Rigorous Examination of Mitigation Strategies for LLM Benchmark Data Contamination
The Emperor's New Clothes in Benchmarking? A Rigorous Examination of Mitigation Strategies for LLM Benchmark Data Contamination
Yifan Sun
Han Wang
Dongbai Li
Gang Wang
Huan Zhang
AAML
43
0
0
20 Mar 2025
Framing the Game: How Context Shapes LLM Decision-Making
Isaac Robinson
John Burden
38
0
0
05 Mar 2025
Reasoning about Affordances: Causal and Compositional Reasoning in LLMs
Reasoning about Affordances: Causal and Compositional Reasoning in LLMs
Magnus F. Gjerde
Vanessa Cheung
David Lagnado
ReLM
LRM
46
0
0
23 Feb 2025
Social Genome: Grounded Social Reasoning Abilities of Multimodal Models
Social Genome: Grounded Social Reasoning Abilities of Multimodal Models
Leena Mathur
Marian Qian
Paul Pu Liang
Louis-Philippe Morency
LRM
57
1
0
21 Feb 2025
Stress Testing Generalization: How Minor Modifications Undermine Large Language Model Performance
Stress Testing Generalization: How Minor Modifications Undermine Large Language Model Performance
Guangxiang Zhao
Saier Hu
Xiaoqi Jian
Jinzhu Wu
Yuhan Wu
Change Jia
Lin Sun
Xiangzheng Zhang
64
0
0
18 Feb 2025
Can We Trust AI Benchmarks? An Interdisciplinary Review of Current Issues in AI Evaluation
Can We Trust AI Benchmarks? An Interdisciplinary Review of Current Issues in AI Evaluation
Maria Eriksson
Erasmo Purificato
Arman Noroozian
Joao Vinagre
Guillaume Chaslot
Emilia Gomez
David Fernandez Llorca
ELM
120
1
0
10 Feb 2025
Multi-Step Reasoning in Korean and the Emergent Mirage
Multi-Step Reasoning in Korean and the Emergent Mirage
Guijin Son
Hyunwoo Ko
Dasol Choi
LRM
ReLM
59
0
0
10 Jan 2025
Does Data Contamination Detection Work (Well) for LLMs? A Survey and Evaluation on Detection Assumptions
Does Data Contamination Detection Work (Well) for LLMs? A Survey and Evaluation on Detection Assumptions
Yujuan Fu
Özlem Uzuner
Meliha Yetisgen
Fei Xia
33
3
0
24 Oct 2024
CAP: Data Contamination Detection via Consistency Amplification
CAP: Data Contamination Detection via Consistency Amplification
Yi Zhao
Jing Li
Linyi Yang
19
1
0
19 Oct 2024
HEALTH-PARIKSHA: Assessing RAG Models for Health Chatbots in Real-World
  Multilingual Settings
HEALTH-PARIKSHA: Assessing RAG Models for Health Chatbots in Real-World Multilingual Settings
Varun Gumma
Anandhita Raghunath
Mohit Jain
Sunayana Sitaram
LM&MA
16
1
0
17 Oct 2024
From Single to Multi: How LLMs Hallucinate in Multi-Document Summarization
From Single to Multi: How LLMs Hallucinate in Multi-Document Summarization
Catarina G. Belem
Pouya Pezeskhpour
Hayate Iso
Seiji Maekawa
Nikita Bhutani
Estevam R. Hruschka
HILM
65
1
0
17 Oct 2024
In-Context Learning for Long-Context Sentiment Analysis on
  Infrastructure Project Opinions
In-Context Learning for Long-Context Sentiment Analysis on Infrastructure Project Opinions
Alireza Shamshiri
Kyeong Rok Ryu
June Young Park
LLMAG
14
1
0
15 Oct 2024
Autonomous Evaluation of LLMs for Truth Maintenance and Reasoning Tasks
Autonomous Evaluation of LLMs for Truth Maintenance and Reasoning Tasks
Rushang Karia
Daniel Bramblett
D. Dobhal
Siddharth Srivastava
ELM
LRM
23
0
0
11 Oct 2024
Fine-tuning can Help Detect Pretraining Data from Large Language Models
Fine-tuning can Help Detect Pretraining Data from Large Language Models
H. Zhang
Songxin Zhang
Bingyi Jing
Hongxin Wei
31
0
0
09 Oct 2024
How Much Can We Forget about Data Contamination?
How Much Can We Forget about Data Contamination?
Sebastian Bordt
Suraj Srinivas
Valentyn Boreiko
U. V. Luxburg
36
1
0
04 Oct 2024
ForecastBench: A Dynamic Benchmark of AI Forecasting Capabilities
ForecastBench: A Dynamic Benchmark of AI Forecasting Capabilities
Ezra Karger
Houtan Bastani
Chen Yueh-Han
Zachary Jacobs
Danny Halawi
Fred Zhang
P. Tetlock
24
6
0
30 Sep 2024
System 2 thinking in OpenAI's o1-preview model: Near-perfect performance
  on a mathematics exam
System 2 thinking in OpenAI's o1-preview model: Near-perfect performance on a mathematics exam
J. D. Winter
Dimitra Dodou
Y. B. Eisma
VLM
ELM
LRM
ReLM
19
9
0
19 Sep 2024
MAVEN-Fact: A Large-scale Event Factuality Detection Dataset
MAVEN-Fact: A Large-scale Event Factuality Detection Dataset
Chunyang Li
Hao Peng
Xiaozhi Wang
Y. Qi
Lei Hou
Bin Xu
Juanzi Li
HILM
20
1
0
22 Jul 2024
VarBench: Robust Language Model Benchmarking Through Dynamic Variable
  Perturbation
VarBench: Robust Language Model Benchmarking Through Dynamic Variable Perturbation
Kun Qian
Shunji Wan
Claudia Tang
Youzhi Wang
Xuanming Zhang
Maximillian Chen
Zhou Yu
AAML
22
8
0
25 Jun 2024
UNO Arena for Evaluating Sequential Decision-Making Capability of Large
  Language Models
UNO Arena for Evaluating Sequential Decision-Making Capability of Large Language Models
Zhanyue Qin
Haochuan Wang
Deyuan Liu
Ziyang Song
Cunhang Fan
...
Zhen Lei
Zhiying Tu
Dianhui Chu
Xiaoyan Yu
Dianbo Sui
ELM
LRM
36
1
0
24 Jun 2024
Unveiling the Spectrum of Data Contamination in Language Models: A
  Survey from Detection to Remediation
Unveiling the Spectrum of Data Contamination in Language Models: A Survey from Detection to Remediation
Chunyuan Deng
Yilun Zhao
Yuzhao Heng
Yitong Li
Jiannan Cao
Xiangru Tang
Arman Cohan
19
1
0
20 Jun 2024
Khayyam Challenge (PersianMMLU): Is Your LLM Truly Wise to The Persian
  Language?
Khayyam Challenge (PersianMMLU): Is Your LLM Truly Wise to The Persian Language?
Omid Ghahroodi
Marzia Nouri
Mohammad V. Sanian
Alireza Sahebi
D. Dastgheib
Ehsaneddin Asgari
M. Baghshah
M. Rohban
ELM
AAML
16
10
0
09 Apr 2024
Top Leaderboard Ranking = Top Coding Proficiency, Always? EvoEval:
  Evolving Coding Benchmarks via LLM
Top Leaderboard Ranking = Top Coding Proficiency, Always? EvoEval: Evolving Coding Benchmarks via LLM
Chun Xia
Yinlin Deng
Lingming Zhang
ALM
ELM
21
25
0
28 Mar 2024
Trained Without My Consent: Detecting Code Inclusion In Language Models
  Trained on Code
Trained Without My Consent: Detecting Code Inclusion In Language Models Trained on Code
Vahid Majdinasab
Amin Nikanjam
Foutse Khomh
25
8
0
14 Feb 2024
Investigating the Impact of Data Contamination of Large Language Models
  in Text-to-SQL Translation
Investigating the Impact of Data Contamination of Large Language Models in Text-to-SQL Translation
Federico Ranaldi
Elena Sofia Ruzzetti
Dario Onorati
Leonardo Ranaldi
Cristina Giannone
Andrea Favalli
Raniero Romagnoli
Fabio Massimo Zanzotto
49
17
0
12 Feb 2024
Task Contamination: Language Models May Not Be Few-Shot Anymore
Task Contamination: Language Models May Not Be Few-Shot Anymore
Changmao Li
Jeffrey Flanigan
71
87
0
26 Dec 2023
Don't Make Your LLM an Evaluation Benchmark Cheater
Don't Make Your LLM an Evaluation Benchmark Cheater
Kun Zhou
Yutao Zhu
Zhipeng Chen
Wentong Chen
Wayne Xin Zhao
Xu Chen
Yankai Lin
Ji-Rong Wen
Jiawei Han
ELM
99
136
0
03 Nov 2023
JudgeLM: Fine-tuned Large Language Models are Scalable Judges
JudgeLM: Fine-tuned Large Language Models are Scalable Judges
Lianghui Zhu
Xinggang Wang
Xinlong Wang
ELM
ALM
54
103
0
26 Oct 2023
Data Contamination Through the Lens of Time
Data Contamination Through the Lens of Time
Manley Roberts
Himanshu Thakur
Christine Herlihy
Colin White
Samuel Dooley
63
30
0
16 Oct 2023
DPIC: Decoupling Prompt and Intrinsic Characteristics for LLM Generated
  Text Detection
DPIC: Decoupling Prompt and Intrinsic Characteristics for LLM Generated Text Detection
Xiao Yu
Yuang Qi
Kejiang Chen
Guoqiang Chen
Xi Yang
Pengyuan Zhu
Xiuwei Shang
Weiming Zhang
Neng H. Yu
DeLMO
9
11
0
21 May 2023
Speak, Memory: An Archaeology of Books Known to ChatGPT/GPT-4
Speak, Memory: An Archaeology of Books Known to ChatGPT/GPT-4
Kent K. Chang
Mackenzie Cramer
Sandeep Soni
David Bamman
RALM
138
109
0
28 Apr 2023
Leveraging Large Language Models for Multiple Choice Question Answering
Leveraging Large Language Models for Multiple Choice Question Answering
Joshua Robinson
Christopher Rytting
David Wingate
ELM
118
181
0
22 Oct 2022
Heroes, Villains, and Victims, and GPT-3: Automated Extraction of
  Character Roles Without Training Data
Heroes, Villains, and Victims, and GPT-3: Automated Extraction of Character Roles Without Training Data
Dominik Stammbach
Maria Antoniak
Elliott Ash
136
32
0
16 May 2022
Training language models to follow instructions with human feedback
Training language models to follow instructions with human feedback
Long Ouyang
Jeff Wu
Xu Jiang
Diogo Almeida
Carroll L. Wainwright
...
Amanda Askell
Peter Welinder
Paul Christiano
Jan Leike
Ryan J. Lowe
OSLM
ALM
301
11,730
0
04 Mar 2022
Chain-of-Thought Prompting Elicits Reasoning in Large Language Models
Chain-of-Thought Prompting Elicits Reasoning in Large Language Models
Jason W. Wei
Xuezhi Wang
Dale Schuurmans
Maarten Bosma
Brian Ichter
F. Xia
Ed H. Chi
Quoc Le
Denny Zhou
LM&Ro
LRM
AI4CE
ReLM
315
8,261
0
28 Jan 2022
Multitask Prompted Training Enables Zero-Shot Task Generalization
Multitask Prompted Training Enables Zero-Shot Task Generalization
Victor Sanh
Albert Webson
Colin Raffel
Stephen H. Bach
Lintang Sutawika
...
T. Bers
Stella Biderman
Leo Gao
Thomas Wolf
Alexander M. Rush
LRM
203
1,651
0
15 Oct 2021
Memorization vs. Generalization: Quantifying Data Leakage in NLP
  Performance Evaluation
Memorization vs. Generalization: Quantifying Data Leakage in NLP Performance Evaluation
Aparna Elangovan
Jiayuan He
Karin Verspoor
TDI
FedML
150
89
0
03 Feb 2021
Extracting Training Data from Large Language Models
Extracting Training Data from Large Language Models
Nicholas Carlini
Florian Tramèr
Eric Wallace
Matthew Jagielski
Ariel Herbert-Voss
...
Tom B. Brown
D. Song
Ulfar Erlingsson
Alina Oprea
Colin Raffel
MLAU
SILM
264
1,798
0
14 Dec 2020
Pre-trained Models for Natural Language Processing: A Survey
Pre-trained Models for Natural Language Processing: A Survey
Xipeng Qiu
Tianxiang Sun
Yige Xu
Yunfan Shao
Ning Dai
Xuanjing Huang
LM&MA
VLM
224
1,281
0
18 Mar 2020
Scaling Laws for Neural Language Models
Scaling Laws for Neural Language Models
Jared Kaplan
Sam McCandlish
T. Henighan
Tom B. Brown
B. Chess
R. Child
Scott Gray
Alec Radford
Jeff Wu
Dario Amodei
220
3,054
0
23 Jan 2020
1