A recent trend in binary code analysis promotes the use of neural solutions based on instruction embedding models. An instruction embedding model is a neural network that transforms sequences of assembly instructions into embedding vectors. If the embedding network is trained such that the translation from code to vectors partially preserves the semantic, the network effectively represents an assembly code model. In this paper we present BinBert, a novel assembly code model. BinBert is built on a transformer pre-trained on a huge dataset of both assembly instruction sequences and symbolic execution information. BinBert can be applied to assembly instructions sequences and it is fine-tunable, i.e. it can be re-trained as part of a neural architecture on task-specific data. Through fine-tuning, BinBert learns how to apply the general knowledge acquired with pre-training to the specific task. We evaluated BinBert on a multi-task benchmark that we specifically designed to test the understanding of assembly code. The benchmark is composed of several tasks, some taken from the literature, and a few novel tasks that we designed, with a mix of intrinsic and downstream tasks. Our results show that BinBert outperforms state-of-the-art models for binary instruction embedding, raising the bar for binary code understanding.
翻译:最近的二进制代码分析趋势促进使用基于指令嵌入模型的神经解决方案。 指令嵌入模型是一个神经网络, 将组装指令序列转换成嵌入矢量。 如果嵌入网络经过培训, 从代码到矢量的翻译部分保留了语义学, 网络有效地代表了一个组装代码模型。 在本文中, 我们展示了一个新颖的组装代码模型BinBert。 BinBert 建在一个变压器上, 预先培训了庞大的组装指令序列和象征性执行信息的数据集。 BinBert 可以应用到组装指令序列中, 并且它是一个细微的可调。 如果嵌入网络经过培训, 从而能够将代码从代码转换成一个神经结构结构的一部分, 也就是说, 它可以被重新训练成一个特定任务。 通过微调, BinBert 学会如何应用在培训前获得的一般知识来完成具体任务。 我们专门设计用于测试组装指令理解的多任务基准。 基准由数项任务组成, 一些来自文献, 一些来自文献, 以及一些新任务, 将新任务作为任务的一部分, 用于我们设计的基本导导导, 模型的版本的版本, 来显示我们所建的版本的版本的版本的版本 格式的版本的版本的版本 。