Beyond Synthetic Benchmarks: Evaluating LLM Performance on Real-World Class-Level Code Generation

30 October 2025

Musfiqur Rahman

SayedHassan Khatoonabadi

Emad Shihab

ELM

ArXiv (abs)PDF HTML Github

Main:35 Pages

5 Figures

14 Tables

Appendix:4 Pages

Abstract

Large language models (LLMs) have advanced code generation at the function level, yet their ability to produce correct class-level implementations in authentic software projects remains poorly understood. This work introduces a novel benchmark derived from open-source repositories, comprising real-world classes divided into seen and unseen partitions to evaluate generalization under practical conditions. The evaluation examines multiple LLMs under varied input specifications, retrieval-augmented configurations, and documentation completeness levels.

View on arXiv

Comments on this paper