In this work, we propose a novel perspective to the problem of patch correctness assessment: a correct patch implements changes that "answer" to a problem posed by buggy behaviour. Concretely, we turn the patch correctness assessment into a Question Answering problem. To tackle this problem, our intuition is that natural language processing can provide the necessary representations and models for assessing the semantic correlation between a bug (question) and a patch (answer). Specifically, we consider as inputs the bug reports as well as the natural language description of the generated patches. Our approach, Quatrain, first considers state of the art commit message generation models to produce the relevant inputs associated to each generated patch. Then we leverage a neural network architecture to learn the semantic correlation between bug reports and commit messages. Experiments on a large dataset of 9135 patches generated for three bug datasets (Defects4j, Bugs.jar and Bears) show that Quatrain can achieve an AUC of 0.886 on predicting patch correctness, and recalling 93% correct patches while filtering out 62% incorrect patches. Our experimental results further demonstrate the influence of inputs quality on prediction performance. We further perform experiments to highlight that the model indeed learns the relationship between bug reports and code change descriptions for the prediction. Finally, we compare against prior work and discuss the benefits of our approach.
翻译:在这项工作中,我们提出了对补丁正确度评估问题的新观点:正确的补丁执行“回答”错误行为造成的问题的修改。具体地说,我们将补丁正确度评估转换成一个问题回答问题。为了解决这个问题,我们的直觉是,自然语言处理可以提供必要的表达和模型,用于评估错误(问题)和补丁(答案)之间的语义相关性。具体地说,我们认为,错误报告以及生成补丁的自然语言描述可以作为投入。我们的方法,Quatrain,首先考虑艺术状态,承诺信息生成模型,以产生与每个生成补丁相关的投入。然后,我们利用神经网络架构学习错误报告与答错信息的语义相关性。为了解决这一问题,我们用一个神经网络架构来学习错误报告与答错信息之间的语义性相关性。对三个错误数据集(缺陷4j、错误.jar和贝斯)生成的9135个大补丁数据集的实验显示,Quatrainen可以实现0.886的ACUC预测补丁, 以及回顾93%的正确度,同时过滤了62%的错误模型的补丁。我们实验结果最终对比了我们之前的预测。我们实验结果的实验结果,我们进一步对比了我们的实验性评估了我们的工作。我们如何对比。我们如何对比了我们的实验性评估了我们。我们的工作。我们如何对比了我们的实验性评估了我们。我们的工作。我们。