55

Is the Cure Still Worse Than the Disease? Test Overfitting by LLMs in Automated Program Repair

Main:4 Pages
5 Figures
Bibliography:1 Pages
4 Tables
Abstract

Automated program repair has been shown to be susceptible to generating repaired code that passes on seen tests but fails on a hold-out set of hidden tests. This problem, dubbed test overfitting, has been identified and studied before the rise of large language models. We experimentally study how much test overfitting is still a problem today, using repository-level SWE-bench tasks.

View on arXiv
Comments on this paper