通过大语言模型模糊深学习图书馆 (Fuzzing Deep-Learning Libraries via Large Language Models)

Detecting bugs in Deep Learning (DL) libraries is critical for almost all downstream DL systems in ensuring effectiveness and safety for the end users. As such, researchers have started developing various fuzzing or testing techniques targeting DL libraries. Previous work can be mainly classified into API-level fuzzing and model-level fuzzing. However, both types of techniques cannot detect bugs that can only be exposed by complex API sequences - API-level fuzzers cannot cover API sequences, while model-level fuzzers can only cover specific API sequence patterns and a small subset of APIs due to complicated input/shape constraints for tensor computations. To address these limitations, we propose LLMFuzz - the first automated approach to directly leveraging Large Pre-trained Language Models (LLMs) to generate input programs for fuzzing DL libraries. LLMs are trained on billions of code snippets and can autoregressively generate human-like code snippets. Our key insight is that modern LLMs can also include numerous code snippets invoking DL library APIs in their training corpora, and thus can implicitly learn the intricate DL API constraints and directly generate/mutate valid DL programs for fuzzing DL libraries. More specifically, we first directly use a generative LLM (e.g., Codex) to generate highquality seed programs based on input prompts. Then, we leverage an evolutionary fuzzing loop which applies an infilling LLM (e.g., InCoder) to further perform small mutations on the seed programs to generate more diverse API sequences for fuzzing DL libraries. Our experimental results on popular DL libraries demonstrate that LLMFuzz is able to cover 91.11% / 24.09% more APIs and achieve 30.38% / 50.84% higher code coverage than state-of-the-art fuzzers on TensorFlow / PyTorch. Furthermore, LLMFuzz is able to detect 65 bugs, with 41 already confirmed as previously unknown bugs.

翻译：深学习( DL) 库中检测错误对于几乎所有下游的 DL 系统确保终端用户的效能和安全至关重要。因此, 研究人员已经开始针对 DL 库开发各种模糊或测试技术。先前的工作可以主要分类为 API 级的模糊和模型级的模糊。但是, 这两种技术都无法检测只能通过复杂的 API 序列暴露的错误 - API 级的模糊器无法覆盖 API 序列, 而模型级的公众Flzzers 只能覆盖特定的 API 序列模式和少量的 API 子集, 原因是对 ARDO 进行复杂的输入/ shape 限制。为了应对这些限制, 我们建议LMUzz - 直接利用大型预先培训的语言模型生成输入程序。数以亿计的代码片段训练, 并可以自动生成像人类一样的代码。我们的关键洞察显示, 现代的PIMLM 还可以包含无数的代码, 在他们培训的 DLL IML 高级程序中, 直接生成一个高级的 OLL 程序, 因此, 将一个数据解算数据程序显示一个数据程序。