Function-level binary code similarity detection is essential in the field of cyberspace security. It helps us find bugs and detect patent infringements in released software and plays a key role in the prevention of supply chain attacks. A practical embedding learning framework relies on the robustness of vector representation system of assembly code and the accuracy of the annotation of function pairs. Supervised learning based methods are traditionally emploied. But annotating different function pairs with accurate labels is very difficult. These supervised learning methods are easily overtrained and suffer from vector robustness issues. To mitigate these problems, we propose Fun2Vec: a contrastive learning framework of function-level representation for binary. We take an unsupervised learning approach and formulate the binary code similarity detection as instance discrimination. Fun2Vec works directly on disassembled binary functions, and could be implemented with any encoder. It does not require manual labeled similar or dissimilar information. We use the compiler optimization options and code obfuscation techniques to generate augmented data. Our experimental results demonstrate that our method surpasses the state-of-the-art in accuracy and have great advantage in few-shot settings.
翻译:功能级二元代码相似性检测在网络空间安全领域至关重要。 它帮助我们发现错误, 检测释放的软件中的专利违规情况, 并在防止供应链袭击方面发挥关键作用。 一个实用的嵌入学习框架依赖于组装代码的矢量代表系统的稳健性和功能配对说明的准确性。 基于学习的监管方法传统上是空置的。 但是, 注释带有准确标签的不同功能配对是非常困难的。 这些受监督的学习方法很容易受到过度培训, 并且受到矢量稳健性问题的影响。 为了缓解这些问题, 我们提议 Fun2Vec: 一个功能级代表的对比性学习框架 。 我们采用一种不受监督的学习方法, 并将二元代码相似性检测作为实例歧视。 Fun2Vec 直接致力于拆分二元函数, 并且可以使用任何编码器执行。 它不需要人工标记相似或不同的信息。 我们使用编译器优化选项和代码模糊技术来生成更多数据。 我们的实验结果表明, 我们的方法在精确性和巨大优势中超过了状态。