治愈是否仍比疾病更糟？大型语言模型在自动程序修复中的测试过拟合问题 (Is the Cure Still Worse Than the Disease? Test Overfitting by LLMs in Automated Program Repair)

Automated program repair has been shown to be susceptible to generating repaired code that passes on seen tests but fails on a hold-out set of hidden tests. This problem, dubbed test overfitting, has been identified and studied before the rise of large language models. We experimentally study how much test overfitting is still a problem today, using repository-level SWE-bench tasks.

翻译：自动程序修复已被证明容易产生在已见测试中通过、但在隐藏测试集上失败的修复代码。这一被称为测试过拟合的问题，在大型语言模型兴起之前便已被识别和研究。我们通过仓库级别的SWE-bench任务，实验性地研究了当前测试过拟合问题仍有多大程度存在。

相关内容

过拟合

关注 8

过拟合，在AI领域多指机器学习得到模型太过复杂，导致在训练集上表现很好，然而在测试集上却不尽人意。过拟合（over-fitting）也称为过学习，它的直观表现是算法在训练集上表现好，但在测试集上表现不好，泛化性能差。过拟合是在模型参数拟合过程中由于训练数据包含抽样误差，在训练时复杂的模型将抽样误差也进行了拟合导致的。

【CVPR2023】魔鬼在查询中:面向真实世界医学图像分割和分布外定位的改进掩模transformer

专知会员服务

25+阅读 · 2023年4月5日

【超越消息传递:图神经网络的物理启发范式】Beyond Message Passing: a Physics-Inspired Paradigm for Graph Neural Networks

专知会员服务

17+阅读 · 2022年5月10日

【EMNLP2020】自然语言分类任务的自监督元学习

专知会员服务

30+阅读 · 2020年9月18日

借助几何先验知识促进深度神经网络：综述 | Boosting Deep Neural Networks with Geometrical Prior Knowledge: A Survey

专知会员服务

29+阅读 · 2020年7月10日