We propose a method combining machine learning with a static analysis tool (i.e. Infer) to automatically repair source code. Machine Learning methods perform well for producing idiomatic source code. However, their output is sometimes difficult to trust as language models can output incorrect code with high confidence. Static analysis tools are trustable, but also less flexible and produce non-idiomatic code. In this paper, we propose to fix resource leak bugs in IR space, and to use a sequence-to-sequence model to propose fix in source code space. We also study several decoding strategies, and use Infer to filter the output of the model. On a dataset of CodeNet submissions with potential resource leak bugs, our method is able to find a function with the same semantics that does not raise a warning with around 97% precision and 66% recall.
翻译:我们提出了一种结合机器学习和静态分析工具(即Infer)的方法,用于自动修复源代码。机器学习方法在生成成惯用语源代码方面效果良好。 但是,由于语言模型可能会以高置信度输出不正确的代码,因此其输出有时很难信任。 静态分析工具可信性高,但灵活性较差且产生的代码不是惯用语。在本文中,我们建议在IR空间中修复资源泄漏缺陷,并使用序列到序列模型在源代码空间中提出修复建议。我们还研究了几种解码策略,并使用Infer来过滤模型输出。 在CodeNet提交的缺陷数据集中,我们的方法能够找到一种具有相同语义但不会引发警告的函数,准确率约为97%,召回率为66%。