In software reverse engineering, decompilation is the process of recovering source code from binary files. Decompilers are used when it is necessary to understand or analyze software for which the source code is not available. Although existing decompilers commonly obtain source code with the same behavior as the binaries, that source code is usually hard to interpret and certainly differs from the original code written by the programmer. Massive codebases could be used to build supervised machine learning models aimed at improving existing decompilers. In this article, we build different classification models capable of inferring the high-level type returned by functions, with significantly higher accuracy than existing decompilers. We automatically instrument C source code to allow the association of binary patterns with their corresponding high-level constructs. A dataset is created with a collection of real open-source applications plus a huge number of synthetic programs. Our system is able to predict function return types with a 79.1% F1-measure, whereas the best decompiler obtains a 30% F1-measure. Moreover, we document the binary patterns used by our classifier to allow their addition in the implementation of existing decompilers.
翻译:在软件反向工程中,解压缩是从二进制文档中回收源代码的过程。 当需要理解或分析源代码所缺的软件时, 使用解压缩器。 虽然现有的解压缩器通常获得源代码, 且其行为与二进制代码相同, 但源代码通常很难解释, 并且肯定与程序员的原始代码不同。 大型代码库可以用来构建监管的机器学习模型, 目的是改进现有的解压缩器。 在本条中, 我们构建了不同的分类模型, 能够根据功能推断出高层次类型, 精确度远高于现有的解压缩器。 我们自动使用 C 源代码, 允许将二进制模式与其相应的高层次构造联系起来。 创建数据集时, 收集了真实的开源应用程序, 外加了大量合成程序。 我们的系统可以用79.1% F1- 度来预测函数返回类型, 而最佳解压缩器则获得 30% F1- 度。 此外, 我们记录了我们的分类器所使用的二进式模式, 以便在实施现有的 deilcompors 中添加它们。