Asteria:基于深学习的跨平台二元守则相似性探测AST-编码 (Asteria: Deep Learning-based AST-Encoding for Cross-platform Binary Code Similarity Detection)

Binary code similarity detection is a fundamental technique for many security applications such as vulnerability search, patch analysis, and malware detection. There is an increasing need to detect similar code for vulnerability search across architectures with the increase of critical vulnerabilities in IoT devices. The variety of IoT hardware architectures and software platforms requires to capture semantic equivalence of code fragments in the similarity detection. However, existing approaches are insufficient in capturing the semantic similarity. We notice that the abstract syntax tree (AST) of a function contains rich semantic information. Inspired by successful applications of natural language processing technologies in sentence semantic understanding, we propose a deep learning-based AST-encoding method, named ASTERIA, to measure the semantic equivalence of functions in different platforms. Our method leverages the Tree-LSTM network to learn the semantic representation of a function from its AST. Then the similarity detection can be conducted efficiently and accurately by measuring the similarity between two representation vectors. We have implemented an open-source prototype of ASTERIA. The Tree-LSTM model is trained on a dataset with 1,022,616 function pairs and evaluated on a dataset with 95,078 function pairs. Evaluation results show that our method outperforms the AST-based tool Diaphora and the-state-of-art method Gemini by large margins with respect to the binary similarity detection. And our method is several orders of magnitude faster than Diaphora and Gemini for the similarity calculation. In the application of vulnerability search, our tool successfully identified 75 vulnerable functions in 5,979 IoT firmware images.

翻译：二进制代码相似性检测是许多安全应用程序的基本技术,例如脆弱性搜索、补丁分析和恶意软件检测。随着IoT设备中严重脆弱性的增加,我们越来越需要检测跨结构结构脆弱性搜索的类似代码。IoT硬件架构和软件平台的多样性要求获取类似检测中代码碎片的语义等同性。然而,现有的方法不足以捕捉语义相似性。我们注意到,函数的抽象语义树(AST)包含丰富的语义信息。在成功应用自然语言处理技术以语言脆弱性理解的启发下,我们建议一种基于深层次学习的AST编码方法,以测量不同平台中关键脆弱性。IoT硬件架构和软件平台的多样性要求获取类似代码的语义等同性。我们的方法将树-LSTM网络用于学习其 AST 函数的语义相似性代表性。然后,通过测量两个表达矢量矢量的相似性,我们实施了ASTERIA的公开源原型。我们用树-LSTM模型模型进行了深入学习,而用1 022 的高级搜索功能对数据进行了培训,用一个类似的方法显示。SDA型工具的直径的功能,用一个直径对等系统进行了测试结果进行了成功评估,以显示。