Automated program repair has been shown to be susceptible to generating repaired code that passes on seen tests but fails on a hold-out set of hidden tests. This problem, dubbed test overfitting, has been identified and studied before the rise of large language models. We experimentally study how much test overfitting is still a problem today, using repository-level SWE-bench tasks.
翻译:自动程序修复已被证明容易产生在已见测试中通过、但在隐藏测试集上失败的修复代码。这一被称为测试过拟合的问题,在大型语言模型兴起之前便已被识别和研究。我们通过仓库级别的SWE-bench任务,实验性地研究了当前测试过拟合问题仍有多大程度存在。