Binary code similarity detection (BCSD) is widely used in various binary analysis tasks such as vulnerability search, malware detection, clone detection, and patch analysis. Recent studies have shown that the learning-based binary code embedding models perform better than the traditional feature-based approaches. In this paper, we proposed a novel transformer-based binary code embedding model, named UniASM, to learn representations of the binary functions. We designed two new training tasks to make the spatial distribution of the generated vectors more uniform, which can be used directly in BCSD without any fine-tuning. In addition, we proposed a new tokenization approach for binary functions, increasing the token's semantic information while mitigating the out-of-vocabulary (OOV) problem. The experimental results show that UniASM outperforms state-of-the-art (SOTA) approaches on the evaluation dataset. We achieved the average scores of recall@1 on cross-compilers, cross-optimization-levels and cross-obfuscations are 0.72, 0.63, and 0.77, which is higher than existing SOTA baselines. In a real-world task of known vulnerability searching, UniASM outperforms all the current baselines.
翻译:在各种二进制分析任务中,例如脆弱性搜索、恶意软件检测、克隆检测和补丁分析,广泛使用二进制代码相似性检测(BCSD ) 。最近的研究显示,学习二进制代码嵌入模型比传统的基于特征的方法效果更好。在本文中,我们提议了一个新型的基于变压器的二进制代码嵌入模型,名为UniASSM,以了解二进制函数的表达方式。我们设计了两个新的培训任务,使生成的矢量的空间分布更加统一,可以在不作任何微调的情况下直接用于BCSD。此外,我们提议了一个新的二进制函数代号化方法,增加代号的语义信息,同时减轻外省(OOOV)问题。实验结果显示,UNASM在评价数据集上比最新艺术(SOTA)嵌入式(SOTA)嵌入式方法更形。我们实现了跨兼容器、交叉操作级和交叉腐化水平的平均回调分数为0.72、0.63和交叉反腐化法化法化方法。此外,我们所提议的新代代代代代代代代代代代代代代号的代代代代代代代代号为0.1,增加了SMASMASM(SM)比现有SM)所有SMSM(SM)所有已知标准基线基线基线上,比现有所有已知标准都高。