Given a binary executable without source code, it is difficult to determine what each function in the binary does by reverse engineering it, and even harder without prior experience and context. In this paper, we performed a comparison of different hashing functions' effectiveness at detecting similar lifted snippets of LLVM IR code, and present the design and implementation of a framework for cross-architecture binary code similarity search database using MinHash as the chosen hashing algorithm, over SimHash, SSDEEP and TLSH. The motivation is to help reverse engineers to quickly gain context of functions in an unknown binary by comparing it against a database of known functions. The code for this project is open source and can be found at https://github.com/h4sh5/bcddb
翻译:鉴于没有源代码的二进制可执行,很难确定二进制中每个函数通过反向工程来完成,而没有先前的经验和背景则更加困难。在本文中,我们比较了不同散列函数在探测LLLVM IR 代码中类似被拆卸的片段方面的有效性,并介绍了跨建筑二进制代码相似性搜索数据库的设计和实施框架,该数据库使用MinHash作为选择的散射算法,超过SimHash、SSDEP和TLSH。其动机是帮助逆向工程师在未知的二进制计算机中快速获取功能背景,将其与已知功能数据库进行比较。该项目的代码是开源的,可以在 https://github.com/h4sh5/bcdb上找到。