Automated program repair (APR) aims to help developers improve software reliability by generating patches for buggy programs. Although many code language models (CLM) are developed and effective in many software tasks such as code completion, there has been little comprehensive, in-depth work to evaluate CLMs' fixing capabilities and to fine-tune CLMs for the APR task. Firstly, this work is the first to evaluate ten CLMs on four APR benchmarks, which shows that surprisingly, the best CLM, as is, fixes 72% more bugs than the state-of-the-art deep-learning (DL)-based APR techniques. Secondly, one of the four APR benchmarks was created by us in this paper to avoid data leaking for a fair evaluation. Thirdly, it is the first work to fine-tune CLMs with APR training data, which shows that fine-tuning brings 31%-1,267% improvement to CLMs and enables them to fix 46%-164% more bugs than existing DL-based APR techniques. Fourthly, this work studies the impact of buggy lines, showing that CLMs, as is, cannot make good use of the buggy lines to fix bugs, yet fine-tuned CLMs could potentially over-rely on buggy lines. Lastly, this work analyzes the size, time, and memory efficiency of different CLMs. This work shows promising directions for the APR domain, such as fine-tuning CLMs with APR-specific designs, and also raises awareness of fair and comprehensive evaluations of CLMs and calls for more transparent reporting of open-source repositories used in the pre-training data to address the data leaking problem.
翻译:自动程序修复中代码语言模型的影响
自动程序修复(APR)旨在通过为有错误的程序生成补丁,帮助开发人员提高软件可靠性。尽管许多代码语言模型(CLM)在许多软件任务(如代码完成)中已经开发并且有效,但很少有全面深入的工作来评估CLMs的修复能力并调整CLMs以适应APR任务。首先,本文首次评估了四个APR基准测试中的十个CLMs,这表明令人惊讶的是,最好的CLM可以修复比最新的基于深度学习(DL)的APR技术多72%的错误。其次,这四个APR基准测试之一是由我们在本文中创建的,以避免数据泄漏进行公正评估。第三,这是第一篇调整APR训练数据的CLMs的工作,这表明微调可以使CLMs的修补差异提高31%-1,267%,并使其修复比现有的基于DL的APR技术更多46%-164%错误。第四,这项工作研究了缺陷行的影响,显示CLMs本来不能很好地利用缺陷行修复错误,但调整后的CLMs可能会过度依赖缺陷行。最后,这项工作分析了不同CLMs的大小,时间和内存效率。这项工作为APR领域提供了有希望的方向,例如使用APR特定设计调整CLMs,并提高CLMs综合评估的透明度,呼吁更多透明的开源代码库来解决数据泄漏问题。