In recent years, code intelligence has gained increasing importance in the field of automated software engineering. Meanwhile, the widespread adoption of Pretrained Language Models (PLMs) and Large Language Models (LLMs) has raised concerns regarding data contamination and its potential impact on model performance evaluation. This paper presents a systematic empirical study to investigate the fine-grained data contamination on code intelligence tasks. Our study involves diverse representative PLMs, namely RoBERTa and GPT-2, and LLMs, namely LLaMA and StarCoder, covering three major tasks: code translation, code generation, and code summarization. We categorize contamination scenarios into four types according to the code intelligence practice, namely input-only, output-only, unpaired, and paired contamination settings, and construct corresponding experimental and control groups for exploration.

View on arXiv

@article{yang2025_2506.02791,
  title={ Rethinking the effects of data contamination in Code Intelligence },
  author={ Zhen Yang and Hongyi Lin and Yifan He and Jie Xu and Zeyu Sun and Shuo Liu and Pengpeng Wang and Zhongxing Yu and Qingyuan Liang },
  journal={arXiv preprint arXiv:2506.02791},
  year={ 2025 }
}

Comments on this paper