Deep learning has recently achieved initial success in program analysis tasks such as bug detection. Lacking real bugs, most existing works construct training and test data by injecting synthetic bugs into correct programs. Despite achieving high test accuracy (e.g., 90%), the resulting bug detectors are found to be surprisingly unusable in practice, i.e., <10% precision when used to scan real software repositories. In this work, we argue that this massive performance difference is caused by a distribution shift, i.e., a fundamental mismatch between the real bug distribution and the synthetic bug distribution used to train and evaluate the detectors. To address this key challenge, we propose to train a bug detector in two phases, first on a synthetic bug distribution to adapt the model to the bug detection domain, and then on a real bug distribution to drive the model towards the real distribution. During these two phases, we leverage a multi-task hierarchy, focal loss, and contrastive learning to further boost performance. We evaluate our approach extensively on three widely studied bug types, for which we construct new datasets carefully designed to capture the real bug distribution. The results demonstrate that our approach is practically effective and successfully mitigates the distribution shift: our learned detectors are highly performant on both our test set and the latest version of open source repositories. Our code, datasets, and models are publicly available at https://github.com/eth-sri/learning-real-bug-detector.
翻译:深度学习最近在错误检测等程序分析任务中取得了初步成功 。 缺乏真正的错误, 多数现有工程通过将合成错误注入正确的程序来构建培训和测试数据。 尽管测试精度很高( 例如, 90% ), 结果产生的错误探测器在实际操作中被认为出乎意料地无法使用, 也就是说, 当用于扫描真实软件库时, 精确度为 < 10% 。 在这项工作中, 我们争辩说, 这个巨大的性能差异是由分配变化造成的, 即 实际错误分布与用于训练和评价探测器的合成错误分布之间的基本不匹配。 为了应对这一关键挑战, 我们提议在两个阶段中训练一个错误探测器, 首先在合成错误分布上将模型调整到错误检测域, 然后在实际的错误分布上发现出出出惊人的错误探测器。 在这两个阶段中, 我们利用一个多功能的等级、 焦点损失和对比性学习来进一步提升性能。 我们用三种广泛研究过的错误类型来评估我们的方法, 我们为此精心设计新的数据集来捕捉到真正的错误分布。 结果表明, 我们的方法是实际有效的, 并且成功地减轻了我们最新的数据库的版本。