Binary code similarity detection (BCSD) has important applications in various fields such as vulnerability detection, software component analysis, and reverse engineering. Recent studies have shown that deep neural networks (DNNs) can comprehend instructions or control-flow graphs (CFG) of binary code and support BCSD. In this study, we propose a novel Transformer-based approach, namely jTrans, to learn representations of binary code. It is the first solution that embeds control flow information of binary code into Transformer-based language models, by using a novel jump-aware representation of the analyzed binaries and a newly-designed pre-training task. Additionally, we release to the community a newly-created large dataset of binaries, BinaryCorp, which is the most diverse to date. Evaluation results show that jTrans outperforms state-of-the-art (SOTA) approaches on this more challenging dataset by 30.5% (i.e., from 32.0% to 62.5%). In a real-world task of known vulnerability searching, jTrans achieves a recall that is 2X higher than existing SOTA baselines.
翻译:二进制代码检测(BCSD)在脆弱性检测、软件元件分析和反向工程等各个领域都有重要的应用。最近的研究显示,深神经网络(DNNs)能够理解二进制代码的指示或控制流图(CFG)并支持BCSD。在这个研究中,我们提出了一种新的基于变异器的方法,即jTrans,以学习二进制代码的表达方式。这是第一个将二进制代码的控制流信息嵌入基于变异器的语言模型的解决办法,方法是利用分析的二进制和新设计的预培训任务的新跳入觉。此外,我们向社区发放了新创建的二进制二进制(Binary Corp)的大型数据集,这是迄今为止最多样化的。评估结果表明,在这种更具挑战性的数据中,30.5%(即从32.0%到62.5%)的基变异语言模型中,将控制流信息嵌入到变异化器语言模型中。在现实世界已知的脆弱性搜索工作中,jTransforms 实现的回收量比现有的SOTA基线高出2X。