Decompilation aims to transform a low-level program language (LPL) (eg., binary file) into its functionally-equivalent high-level program language (HPL) (e.g., C/C++). It is a core technology in software security, especially in vulnerability discovery and malware analysis. In recent years, with the successful application of neural machine translation (NMT) models in natural language processing (NLP), researchers have tried to build neural decompilers by borrowing the idea of NMT. They formulate the decompilation process as a translation problem between LPL and HPL, aiming to reduce the human cost required to develop decompilation tools and improve their generalizability. However, state-of-the-art learning-based decompilers do not cope well with compiler-optimized binaries. Since real-world binaries are mostly compiler-optimized, decompilers that do not consider optimized binaries have limited practical significance. In this paper, we propose a novel learning-based approach named NeurDP, that targets compiler-optimized binaries. NeurDP uses a graph neural network (GNN) model to convert LPL to an intermediate representation (IR), which bridges the gap between source code and optimized binary. We also design an Optimized Translation Unit (OTU) to split functions into smaller code fragments for better translation performance. Evaluation results on datasets containing various types of statements show that NeurDP can decompile optimized binaries with 45.21% higher accuracy than state-of-the-art neural decompilation frameworks.
翻译:解析的目的是将低级别程序语言(例如,二进制文件)转换为功能等效高级别程序语言(例如,C/C+++),这是软件安全的核心技术,特别是在脆弱性发现和恶意软件分析方面。近年来,随着自然语言处理(NLP)中成功应用神经机翻译模型(NMT),研究人员试图通过借用NMT的概念来建立神经解调器。他们将解调进程作为LPL和HPL之间的翻译问题,目的是降低开发解析工具并提高其通用性所需的人类成本。然而,基于学习的解调器无法很好地应对编译器-优化的二进制。由于现实世界的二进制文件大多是编译器,不认为优化的双进制文件具有有限的实际意义。在本文中,我们提出了一个名为 NeurDP 的基于新学习的方法,目标是将45的解译工具类型转换成一个更精化的网络格式化的解析器, 也就是将内部的解析器显示器转换成一个更精确的版本。