Towards predicting patch correctness in APR, we propose a simple, but novel hypothesis on how the link between the patch behaviour and failing test specifications can be drawn: similar failing test cases should require similar patches. We then propose BATS, an unsupervised learning-based system to predict patch correctness by checking patch Behaviour Against failing Test Specification. BATS exploits deep representation learning models for code and patches: for a given failing test case, the yielded embedding is used to compute similarity metrics in the search for historical similar test cases in order to identify the associated applied patches, which are then used as a proxy for assessing generated patch correctness. Experimentally, we first validate our hypothesis by assessing whether ground-truth developer patches cluster together in the same way that their associated failing test cases are clustered. Then, after collecting a large dataset of 1278 plausible patches (written by developers or generated by some 32 APR tools), we use BATS to predict correctness: BATS achieves an AUC between 0.557 to 0.718 and a recall between 0.562 and 0.854 in identifying correct patches. Compared against previous work, we demonstrate that our approach outperforms state-of-the-art performance in patch correctness prediction, without the need for large labeled patch datasets in contrast with prior machine learning-based approaches. While BATS is constrained by the availability of similar test cases, we show that it can still be complementary to existing approaches: used in conjunction with a recent approach implementing supervised learning, BATS improves the overall recall in detecting correct patches. We finally show that BATS can be complementary to the state-of-the-art PATCH-SIM dynamic approach of identifying the correct patches for APR tools.
翻译:我们提出了一个简单但新颖的假设,说明如何可以得出补丁行为与失效测试规格之间的联系:类似的失败测试案例应当要求类似的补丁。然后我们提出BATS,这是一个未经监督的学习基础系统,通过检查补丁行为来预测补丁正确性,通过检查失败测试规格来预测补丁正确性。然后,BATS为代码和补丁利用了深度代表学习模式:对于一个特定的失败测试案例,所产生的嵌入在寻找历史相似的测试案例时用于计算相似性指标,以便确定相关的应用补丁,然后用来作为评估产生的补丁的代用。实验性地,我们首先验证我们的假设,通过评估地面真相开发者补丁(BATS)是否以与其相关的失败测试案例组合在一起。随后,在收集了1278个整齐的大型补丁(由开发者编写或由大约32个PRA工具生成的补丁)的数据集后,我们仍然使用BATS来预测正确性:BATS方法在0.557至0.718之间得出AUC的补丁方法,然后用0.562到0.854的补丁方法来确定准确的补丁。 比较前的补丁,我们用前的补丁方法可以显示前的校正的校正的补丁方法,我们用前的校正的校正的校正的校正的补。