Determining whether multiple instructions can access the same memory location is a critical task in binary analysis. It is challenging as statically computing precise alias information is undecidable in theory. The problem aggravates at the binary level due to the presence of compiler optimizations and the absence of symbols and types. Existing approaches either produce significant spurious dependencies due to conservative analysis or scale poorly to complex binaries. We present a new machine-learning-based approach to predict memory dependencies by exploiting the model's learned knowledge about how binary programs execute. Our approach features (i) a self-supervised procedure that pretrains a neural net to reason over binary code and its dynamic value flows through memory addresses, followed by (ii) supervised finetuning to infer the memory dependencies statically. To facilitate efficient learning, we develop dedicated neural architectures to encode the heterogeneous inputs (i.e., code, data values, and memory addresses from traces) with specific modules and fuse them with a composition learning strategy. We implement our approach in NeuDep and evaluate it on 41 popular software projects compiled by 2 compilers, 4 optimizations, and 4 obfuscation passes. We demonstrate that NeuDep is more precise (1.5x) and faster (3.5x) than the current state-of-the-art. Extensive probing studies on security-critical reverse engineering tasks suggest that NeuDep understands memory access patterns, learns function signatures, and is able to match indirect calls. All these tasks either assist or benefit from inferring memory dependencies. Notably, NeuDep also outperforms the current state-of-the-art on these tasks.
翻译:在二进制分析中,确定多个指令能否访问相同的记忆位置是一项关键的任务。 这是一项艰巨的任务, 因为静态计算准确的别名信息在理论上是无法判断的。 由于存在编译器优化和没有符号和类型, 问题在二进制水平上更加严重。 现有的方法要么由于保守分析而产生巨大的虚假依赖性, 要么对复杂的二进制规模差。 我们提出了一个基于机器学习的新方法, 利用模型所学到的关于二进制程序如何执行的知识来预测记忆依赖性。 我们的方法具有以下特点:(一) 一个自我监督的程序, 在神经网之前, 无法解释二进制代码及其动态值通过记忆地址流动, 其次是 (二) 监督微调以推断记忆依赖性静态。 为了便利高效学习, 我们开发了专门的神经结构架构, 用特定的模块(如代码、数据值、记忆存储地址)来预测记忆依赖性。 我们在NeueuD节中采用的方法, 评估41个由2编译器编译的精度软件项目, 4进制系统, 和4进制。