v1v2 (latest)

Beyond Synthetic Benchmarks: Evaluating LLM Performance on Real-World Class-Level Code Generation

30 October 2025

Musfiqur Rahman

SayedHassan Khatoonabadi

Emad Shihab

ELM

ArXiv (abs)PDF HTML Github

Main:35 Pages

5 Figures

14 Tables

Appendix:4 Pages

Abstract

Large language models (LLMs) have demonstrated strong performance on function-level code generation benchmarks, yet real-world software development increasingly demands class-level implementations that integrate multiple methods, attributes, and dependencies within authentic project contexts. This gap between benchmark performance and practical utility raises critical questions about LLMs' readiness for production code assistance, particularly regarding their ability to generalize across familiar and novel codebases.We introduce a benchmark derived from real-world open-source repositories, comprising classes divided into seen and unseen partitions to evaluate generalization under practical conditions. We systematically examine how input specification completeness and retrieval-augmented generation affect class-level correctness across multiple state-of-the-art LLMs.Our evaluation reveals a substantial performance gap: while LLMs achieve 84 to 89% correctness on synthetic benchmarks, they attain only 25 to 34% on real-world class tasks, with minimal distinction between familiar and novel codebases. Comprehensive documentation provides marginal improvements (1 to 3%), whereas retrieval augmentation yields greater gains (4 to 7%) by supplying concrete implementation patterns. Error analysis identifies AttributeError, TypeError, and AssertionError as dominant failure modes, with distinct patterns between synthetic and real-world scenarios.These findings provide actionable insights for enhancing context modelling, documentation strategies, and retrieval integration in production code assistance tools.

View on arXiv