Deep learning has recently achieved initial success in program analysis tasks such as bug detection. Lacking real bugs, most existing works construct training and test data by injecting synthetic bugs into correct programs. Despite achieving high test accuracy (e.g. >90%), the resulting bug detectors are found to be surprisingly unusable in practice, i.e., <10% precision when used to scan real software repositories. In this work, we argue that this massive performance difference is caused by distribution shift, i.e., a fundamental mismatch between the real bug distribution and the synthetic bug distribution used to train and evaluate the detectors. To address this key challenge, we propose to train a bug detector in two phases, first on a synthetic bug distribution to adapt the model to the bug detection domain, and then on a real bug distribution to drive the model towards the real distribution. During these two phases, we leverage a multi-task hierarchy, focal loss, and contrastive learning to further boost performance. We evaluate our approach extensively on three widely studied bug types, for which we construct new datasets carefully designed to capture the real bug distribution. The results demonstrate that our approach is practically effective and successfully mitigates the distribution shift: our learned detectors are highly performant on both our constructed test set and the latest version of open source repositories.
翻译:缺乏真正的错误, 多数现有的工程通过将合成错误输入正确的程序来构建培训和测试数据。 尽管实现了高测试精度( 例如 > 90 % ), 结果产生的错误探测器在实践上却被发现令人惊讶地无法使用, 也就是说, 用于扫描真实软件库的精确度为 < 10% 。 在这项工作中, 我们争辩说, 这种巨大的性能差异是由分布变化造成的, 即实际错误分布与用于训练和评价探测器的合成错误分布之间的根本不匹配。 为了应对这一关键挑战, 我们提议在两个阶段中训练一个错误探测器, 首先在合成错误分布上, 使模型适应错误检测域, 然后在真正的错误分布上发现出令人惊讶, 也就是在这两个阶段, 我们利用一个多功能等级、 焦点损失和对比性学习来进一步提升性能。 我们用三种广泛研究的错误类型来评估我们的方法, 我们为此精心设计了新数据集, 以捕捉到真正的错误分布。 结果表明, 我们的方法是实际有效的, 并且成功地测量了我们最新的分发工具的版本 。