74

AICD Bench: A Challenging Benchmark for AI-Generated Code Detection

Daniil Orel
Dilshod Azizov
Indraneil Paul
Yuxia Wang
Iryna Gurevych
Preslav Nakov
Main:9 Pages
12 Figures
Bibliography:5 Pages
13 Tables
Appendix:12 Pages
Abstract

Large language models (LLMs) are increasingly capable of generating functional source code, raising concerns about authorship, accountability, and security. While detecting AI-generated code is critical, existing datasets and benchmarks are narrow, typically limited to binary human-machine classification under in-distribution settings. To bridge this gap, we introduce AICD Bench\emph{AICD Bench}, the most comprehensive benchmark for AI-generated code detection. It spans 2M examples\emph{2M examples}, 77 models\emph{77 models} across 11 families\emph{11 families}, and 9 programming languages\emph{9 programming languages}, including recent reasoning models. Beyond scale, AICD Bench introduces three realistic detection tasks: (i\emph{i})~Robust Binary Classification\emph{Robust Binary Classification} under distribution shifts in language and domain, (ii\emph{ii})~Model Family Attribution\emph{Model Family Attribution}, grouping generators by architectural lineage, and (iii\emph{iii})~Fine-Grained Human-Machine Classification\emph{Fine-Grained Human-Machine Classification} across human, machine, hybrid, and adversarial code. Extensive evaluation on neural and classical detectors shows that performance remains far below practical usability, particularly under distribution shift and for hybrid or adversarial code. We release AICD Bench as a unified, challenging evaluation suite\emph{unified, challenging evaluation suite} to drive the next generation of robust approaches for AI-generated code detection. The data and the code are available atthis https URL}.

View on arXiv
Comments on this paper