Is the Cure Still Worse Than the Disease? Test Overfitting by LLMs in Automated Program Repair

Main:4 Pages
5 Figures
Bibliography:1 Pages
4 Tables
Abstract
Automated program repair has been shown to be susceptible to generating repaired code that passes on seen tests but fails on a hold-out set of hidden tests. This problem, dubbed test overfitting, has been identified and studied before the rise of large language models. We experimentally study how much test overfitting is still a problem today, using repository-level SWE-bench tasks.
View on arXivComments on this paper
