Fuzzing has become one of the most popular techniques to identify bugs in software. To improve the fuzzing process, a plethora of techniques have recently appeared in academic literature. However, evaluating and comparing these techniques is challenging as fuzzers depend on randomness when generating test inputs. Commonly, existing evaluations only partially follow best practices for fuzzing evaluations. We argue that the reason for this are twofold. First, it is unclear if the proposed guidelines are necessary due to the lack of comprehensive empirical data in the case of fuzz testing. Second, there does not yet exist a framework that integrates statistical evaluation techniques to enable fair comparison of fuzzers. To address these limitations, we introduce a novel fuzzing evaluation framework called SENF (Statistical EvaluatioN of Fuzzers). We demonstrate the practical applicability of our framework by utilizing the most wide-spread fuzzer AFL as our baseline fuzzer and exploring the impact of different evaluation parameters (e.g., the number of repetitions or run-time), compilers, seeds, and fuzzing strategies. Using our evaluation framework, we show that supposedly small changes of the parameters can have a major influence on the measured performance of a fuzzer.
翻译:模糊已成为查明软件错误的最流行技术之一。为了改进模糊过程,最近学术文献中出现了大量技术。然而,评价和比较这些技术具有挑战性,因为模糊器在产生测试投入时取决于随机性。一般而言,现有的评估只是部分遵循了模糊评估的最佳做法。我们争辩说,这样做的原因是双重的。首先,由于在模糊测试中缺乏全面的经验数据,拟议准则是否必要。第二,还没有一个综合统计评估技术的框架,以便能够对模糊器进行公平的比较。为了解决这些限制,我们引入了一个新的模糊评估框架,称为SENF(F(Fluzzers的统计评估))。我们通过使用最广泛的模糊器AFLL作为基线模糊器,并探索不同评估参数(例如重复或运行时间)、编译器、种子和模糊战略的影响,我们用我们的评估框架,我们显示参数的所谓小改动对模糊度的测量,可以对模糊度产生主要影响。