We propose a method combining machine learning with a static analysis tool (i.e. Infer) to automatically repair source code. Machine Learning methods perform well for producing idiomatic source code. However, their output is sometimes difficult to trust as language models can output incorrect code with high confidence. Static analysis tools are trustable, but also less flexible and produce non-idiomatic code. In this paper, we propose to fix resource leak bugs in IR space, and to use a sequence-to-sequence model to propose fix in source code space. We also study several decoding strategies, and use Infer to filter the output of the model. On a dataset of CodeNet submissions with potential resource leak bugs, our method is able to find a function with the same semantics that does not raise a warning with around 97% precision and 66% recall.
翻译:我们提出了一种结合机器学习和静态分析工具(即Infer)的方法,以自动修复源代码。机器学习方法表现良好,能够生成惯用的源代码。但是,它们的输出有时很难信任,因为语言模型可以高置信度地输出错误的代码。 静态分析工具可信,但也不够灵活,且产生的代码不是惯用的。在本文中,我们建议在IR空间中修复资源泄漏漏洞,并使用序列到序列模型在源代码空间中建议修复。我们还研究了几种解码策略,并使用Infer过滤模型的输出。在一个具有潜在资源泄漏漏洞的CodeNet提交数据集上,我们的方法能够找到一个与其语义相同但不会引发警告的函数,大约精度为97%,召回率为66%。