Automated program repair (APR) aims to help developers improve software reliability by generating patches for buggy programs. Although many code language models (CLM) are developed and effective in many software tasks such as code completion, there has been little comprehensive, in-depth work to evaluate CLMs' fixing capabilities and to fine-tune CLMs for the APR task. Firstly, this work is the first to evaluate ten CLMs on four APR benchmarks, which shows that surprisingly, the best CLM, as is, fixes 72% more bugs than the state-of-the-art deep-learning (DL)-based APR techniques. Secondly, one of the four APR benchmarks was created by us in this paper to avoid data leaking for a fair evaluation. Thirdly, it is the first work to fine-tune CLMs with APR training data, which shows that fine-tuning brings 31%-1,267% improvement to CLMs and enables them to fix 46%-164% more bugs than existing DL-based APR techniques. Fourthly, this work studies the impact of buggy lines, showing that CLMs, as is, cannot make good use of the buggy lines to fix bugs, yet fine-tuned CLMs could potentially over-rely on buggy lines. Lastly, this work analyzes the size, time, and memory efficiency of different CLMs. This work shows promising directions for the APR domain, such as fine-tuning CLMs with APR-specific designs, and also raises awareness of fair and comprehensive evaluations of CLMs and calls for more transparent reporting of open-source repositories used in the pre-training data to address the data leaking problem.
翻译:自动程序修理(APR) 旨在帮助开发者通过为错误程序创建补丁来提高软件可靠性。 尽管许多代码语言模型(CLM)在代码完成等许多软件任务中开发并有效, 但很少开展全面深入的工作来评估 CLM 的固定能力, 并微调 CLMs 用于 PRA 任务。 首先, 这项工作首次根据 4 RA 基准评估了 10 CLMs 的10 CLMs, 这表明, 令人惊讶的是, 最佳 CLM 和目前基于 DLRR 的透明深度学习( DL) 技术相比, 修复了72%的错误。 其次, 本文中我们为避免数据泄漏以公平评估而创建了四个 RAM 基准中的一条基准 。 第三, 这是第一次用 微调 CLMS 来微调 CLMS, 微调使 C- LRMS 的打开错误超过 46%-164% 。 第四, 这项工作对 微调行的影响, 显示 CLMMS 和 CRMS 上 的错误 的错误的计算方法可能使用。