Can You Really Trust Code Copilots? Evaluating Large Language Models from a Code Security Perspective

15 May 2025

Abstract

Code security and usability are both essential for various coding assistant applications driven by large language models (LLMs). Current code security benchmarks focus solely on single evaluation task and paradigm, such as code completion and generation, lacking comprehensive assessment across dimensions like secure code generation, vulnerability repair and discrimination. In this paper, we first propose CoV-Eval, a multi-task benchmark covering various tasks such as code completion, vulnerability repair, vulnerability detection and classification, for comprehensive evaluation of LLM code security. Besides, we developed VC-Judge, an improved judgment model that aligns closely with human experts and can review LLM-generated programs for vulnerabilities in a more efficient and reliable way. We conduct a comprehensive evaluation of 20 proprietary and open-source LLMs. Overall, while most LLMs identify vulnerable codes well, they still tend to generate insecure codes and struggle with recognizing specific vulnerability types and performing repairs. Extensive experiments and qualitative analyses reveal key challenges and optimization directions, offering insights for future research in LLM code security.

View on arXiv

@article{mou2025_2505.10494,
  title={ Can You Really Trust Code Copilots? Evaluating Large Language Models from a Code Security Perspective },
  author={ Yutao Mou and Xiao Deng and Yuxiao Luo and Shikun Zhang and Wei Ye },
  journal={arXiv preprint arXiv:2505.10494},
  year={ 2025 }
}

Comments on this paper