92

Beyond Synthetic Benchmarks: Evaluating LLM Performance on Real-World Class-Level Code Generation

Main:35 Pages
5 Figures
14 Tables
Appendix:4 Pages
Abstract

Large language models (LLMs) have advanced code generation at the function level, yet their ability to produce correct class-level implementations in authentic software projects remains poorly understood. This work introduces a novel benchmark derived from open-source repositories, comprising real-world classes divided into seen and unseen partitions to evaluate generalization under practical conditions. The evaluation examines multiple LLMs under varied input specifications, retrieval-augmented configurations, and documentation completeness levels.

View on arXiv
Comments on this paper