评价合成虫 (Evaluating Synthetic Bugs)

Fuzz testing has been used to find bugs in programs since the 1990s, but despite decades of dedicated research, there is still no consensus on which fuzzing techniques work best. One reason for this is the paucity of ground truth: bugs in real programs with known root causes and triggering inputs are difficult to collect at a meaningful scale. Bug injection technologies that add synthetic bugs into real programs seem to offer a solution, but the differences in finding these synthetic bugs versus organic bugs have not previously been explored at a large scale. Using over 80 years of CPU time, we ran eight fuzzers across 20 targets from the Rode0day bug-finding competition and the LAVA-M corpus. Experiments were standardized with respect to compute resources and metrics gathered. These experiments show differences in fuzzer performance as well as the impact of various configuration options. For instance, it is clear that integrating symbolic execution with mutational fuzzing is very effective and that using dictionaries improves performance. Other conclusions are less clear-cut; for example, no one fuzzer beat all others on all tests. It is noteworthy that no fuzzer found any organic bugs (i.e., one reported in a CVE), despite 50 such bugs being available for discovery in the fuzzing corpus. A close analysis of results revealed a possible explanation: a dramatic difference between where synthetic and organic bugs live with respect to the ''main path'' discovered by fuzzers. We find that recent updates to bug injection systems have made synthetic bugs more difficult to discover, but they are still significantly easier to find than organic bugs in our target programs. Finally, this study identifies flaws in bug injection techniques and suggests a number of axes along which synthetic bugs should be improved.

翻译：自1990年代以来,Fuzz测试被用于在程序中找到错误,但尽管进行了数十年的专门研究,对于哪些模糊技术最有效还是没有共识。原因之一是缺乏地面真理:在真实程序中存在已知根源和触发投入的错误很难以有意义的规模收集。在真实程序中添加合成错误的错误注入技术似乎提供了一种解决办法,但在寻找这些合成错误和有机错误方面的差异以前并没有大规模探讨。在80多年的CPU时间中,我们从Rode0day 错误调查竞赛和LAVA-Mampiro的20个目标中共运行了8个模糊器。这其中的原因之一是缺乏地面真理:在计算资源和收集的参数方面,实验是标准化的。这些实验显示了在模糊性表现方面的差异,以及各种配置选项的影响。例如,很明显,将符号执行与突变错误合并起来是十分有效的,使用字典来改进性能。其他结论比合成目标更新要难得多;例如,在所有测试中,发现没有比其他所有目标更新更难。值得注意的是,在快速分析中发现任何模糊的比其他所有目标都比其他目标都容易。一个测试方法标准化,最近才发现一个错误,在发现一个错误中发现一个错误的路径中发现一个错误,在任何错误中发现。最后的发现一个错误的发现一个错误中发现一个错误,在任何一个错误,在任何一个错误的发现一个错误的发现。最后的发现一个错误的发现一个错误,在任何错误,在任何一个错误的发现。在任何一个错误的发现一个错误的发现,在任何错误,在任何线索,在任何错误在任何错误中发现。