Fuzzing is a key method to discover vulnerabilities in programs. Despite considerable progress in this area in the past years, measuring and comparing the effectiveness of fuzzers is still an open research question. In software testing, the gold standard for evaluating test quality is mutation analysis, assessing the ability of a test to detect synthetic bugs; if a set of tests fails to detect such mutations, it will also fail to detect real bugs. Mutation analysis subsumes various coverage measures and provides a large and diverse set of faults that can be arbitrarily hard to trigger and detect, thus preventing the problems of saturation and overfitting. Unfortunately, the cost of traditional mutation analysis is exorbitant for fuzzing, as mutations need independent evaluation. In this paper, we apply modern mutation analysis techniques that pool multiple mutations; allowing us, for the first time, to evaluate and compare fuzzers with mutation analysis. We introduce an evaluation bench for fuzzers and apply it to a number of popular fuzzers and subjects. In a comprehensive evaluation, we show how it allows us to assess fuzzer performance and measure the impact of improved techniques. While we find that today's fuzzers can detect only a small percentage of mutations, this should be seen as a challenge for future research -- notably in improving (1) detecting failures beyond generic crashes (2) triggering mutations (and thus faults).
翻译:模糊是发现程序脆弱性的关键方法。 尽管过去几年在这一领域取得了相当大的进展, 测量和比较模糊器的有效性仍然是一个开放的研究问题。 在软件测试中, 评估测试质量的黄金标准是突变分析, 评估检测合成虫子的测试能力; 如果一组测试无法检测出这种突变, 它也无法检测出真正的错误。 Mudication 分析将各种覆盖措施包含在各种覆盖范围中, 并提供一系列大而多样的缺陷, 这可以任意地触发和探测, 从而防止饱和和过度适应的问题。 不幸的是, 传统的突变分析的成本对于模糊性来说是高昂的, 因为突变需要独立评估。 在本文中, 我们应用现代突变分析技术, 集合多种突变的测试能力; 允许我们第一次用突变分析来评估和比较模糊器。 我们为模糊器设置了一个评价台, 并将它应用于一些流行的模糊器和对象。 在一项全面评估中, 我们展示它如何让我们评估模糊性的表现和测量改进技术的影响。 不幸的是, 传统的突变分析成本, 因此我们发现今天要先去检测一次。