Fuzzing has become a commonly used approach to identifying bugs in complex, real-world programs. However, interpreters are notoriously difficult to fuzz effectively, as they expect highly structured inputs, which are rarely produced by most fuzzing mutations. For this class of programs, grammar-based fuzzing has been shown to be effective. Tools based on this approach can find bugs in the code that is executed after parsing the interpreter inputs, by following language-specific rules when generating and mutating test cases. Unfortunately, grammar-based fuzzing is often unable to discover subtle bugs associated with the parsing and handling of the language syntax. Additionally, if the grammar provided to the fuzzer is incomplete, or does not match the implementation completely, the fuzzer will fail to exercise important parts of the available functionality. In this paper, we propose a new fuzzing technique, called Token-Level Fuzzing. Instead of applying mutations either at the byte level or at the grammar level, Token-Level Fuzzing applies mutations at the token level. Evolutionary fuzzers can leverage this technique to both generate inputs that are parsed successfully and generate inputs that do not conform strictly to the grammar. As a result, the proposed approach can find bugs that neither byte-level fuzzing nor grammar-based fuzzing can find. We evaluated Token-Level Fuzzing by modifying AFL and fuzzing four popular JavaScript engines, finding 29 previously unknown bugs, several of which could not be found with state-of-the-art byte-level and grammar-based fuzzers.
翻译:令牌级模糊测试
模糊测试已经成为在复杂的现实世界程序中识别错误的常用方法。然而,解释器通常难以有效地进行模糊测试,因为它们期望高度结构化的输入,而大多数模糊测试变异很少产生这样的输入。对于这类程序,基于语法的模糊测试已被证明是有效的。基于这种方法的工具,可以通过在生成和变异测试用例时遵循语言特定的规则,找到在解释器输入解析后执行的代码中的错误。不幸的是,基于语法的模糊测试通常无法发现与语言语法的解析和处理有关的微妙错误。此外,如果提供给模糊测试器的语法不完整或与实现不完全匹配,则模糊测试器将无法执行可用功能的重要部分。在本文中,我们提出了一种新的模糊测试技术,称为令牌级模糊测试。令牌级模糊测试不是在字节级别或基于语法的模糊测试级别应用变异,而是在令牌级别应用变异。进化模糊测试器可以利用这种技术来生成成功解析的输入,并生成不严格符合语法的输入。因此,该方法可以发现不仅仅是字节级模糊测试和基于语法的模糊测试无法发现的错误类型。我们通过修改AFL并对四个流行的JavaScript引擎进行模糊测试来评估令牌级模糊测试,在其中发现了29个以前未知的错误,其中几个使用最先进的基于字节和基于语法的模糊测试器无法找到。