Binary code similarity detection (BCSD) is widely used in various binary analysis tasks such as vulnerability search, malware detection, clone detection, and patch analysis. Recent studies have shown that the learning-based binary code embedding models perform better than the traditional feature-based approaches. In this paper, we propose a novel transformer-based binary code embedding model named UniASM to learn representations of the binary functions. We design two new training tasks to make the spatial distribution of the generated vectors more uniform, which can be used directly in BCSD without any fine-tuning. In addition, we present a new tokenization approach for binary functions, which increases the token's semantic information and mitigates the out-of-vocabulary (OOV) problem. We conduct an in-depth analysis of the factors affecting model performance through ablation experiments and obtain some new and valuable findings. The experimental results show that UniASM outperforms the state-of-the-art (SOTA) approach on the evaluation dataset. The average scores of Recall@1 on cross-compilers, cross-optimization levels, and cross-obfuscations are 0.77, 0.72, and 0.72. Besides, in the real-world task of known vulnerability search, UniASM outperforms all the current baselines.
翻译:二进制代码相似性检测(BCSD)广泛应用于各种二进制分析任务,如漏洞搜索、恶意软件检测、克隆检测和补丁分析。最近的研究表明,基于学习的二进制代码嵌入模型比传统的基于特征的方法表现更好。本文提出了一种新颖的基于Transformer的二进制代码嵌入模型UniASM,用于学习二进制函数的表示。我们设计了两个新的训练任务,使生成的向量的空间分布更加均匀,可以在无需任何微调的情况下直接用于BCSD。此外,我们提出了一种新的二进制函数标记方法,增加了标记的语义信息,缓解了词汇外(OOV)问题。我们通过消融实验对影响模型性能的因素进行了深入分析,并获得了一些新的有价值的发现。实验结果表明,UniASM在评估数据集上优于现有技术(SOTA)方法。交叉编译器、交叉优化级别和交叉混淆的Recall@1平均分别达到了0.77、0.72和0.72。此外,在已知漏洞搜索的实际任务中,UniASM优于所有当前基线。