ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2506.02791
40
0
v1v2 (latest)

Rethinking the effects of data contamination in Code Intelligence

3 June 2025
Zhen Yang
Hongyi Lin
Yifan He
Jie Xu
Zeyu Sun
Shuo Liu
P. Wang
Zhongxing Yu
Qingyuan Liang
ArXiv (abs)PDFHTML
Main:10 Pages
1 Figures
Bibliography:2 Pages
7 Tables
Abstract

In recent years, code intelligence has gained increasing importance in the field of automated software engineering. Meanwhile, the widespread adoption of Pretrained Language Models (PLMs) and Large Language Models (LLMs) has raised concerns regarding data contamination and its potential impact on model performance evaluation. This paper presents a systematic empirical study to investigate the fine-grained data contamination on code intelligence tasks. Our study involves diverse representative PLMs, namely RoBERTa and GPT-2, and LLMs, namely LLaMA and StarCoder, covering three major tasks: code translation, code generation, and code summarization. We categorize contamination scenarios into four types according to the code intelligence practice, namely input-only, output-only, unpaired, and paired contamination settings, and construct corresponding experimental and control groups for exploration.

View on arXiv
@article{yang2025_2506.02791,
  title={ Rethinking the effects of data contamination in Code Intelligence },
  author={ Zhen Yang and Hongyi Lin and Yifan He and Jie Xu and Zeyu Sun and Shuo Liu and Pengpeng Wang and Zhongxing Yu and Qingyuan Liang },
  journal={arXiv preprint arXiv:2506.02791},
  year={ 2025 }
}
Comments on this paper