Binary authorship analysis is a significant problem in many software engineering applications. In this paper, we formulate a binary authorship verification task to accurately reflect the real-world working process of software forensic experts. It aims to determine whether an anonymous binary is developed by a specific programmer with a small set of support samples, and the actual developer may not belong to the known candidate set but from the wild. We propose an effective binary authorship verification framework, BinMLM. BinMLM trains the RNN language model on consecutive opcode traces extracted from the control-flow-graph (CFG) to characterize the candidate developers' programming styles. We build a mixture-of-shared architecture with multiple shared encoders and author-specific gate layers, which can learn the developers' combination preferences of universal programming patterns and alleviate the problem of low training resources. Through an optimization pipeline of external pre-training, joint training, and fine-tuning, our framework can eliminate additional noise and accurately distill developers' unique styles. Extensive experiments show that BinMLM achieves promising results on Google Code Jam (GCJ) and Codeforces datasets with different numbers of programmers and supporting samples. It significantly outperforms the baselines built on the state-of-the-art feature set (4.73% to 19.46% improvement) and remains robust in multi-author collaboration scenarios. Furthermore, BinMLM can perform organization-level verification on a real-world APT malware dataset, which can provide valuable auxiliary information for exploring the group behind the APT attack.
翻译:BinMLMM. BinMLMMMMM 将 RNN 语言模型用于一系列软件工程应用,这是许多软件工程应用中的一个重大问题。在本文件中,我们制定了一个二进制文件核查任务,以准确地反映软件法医专家在现实世界中的工作过程。它旨在确定一个匿名二进制文件是否由拥有少量支持样本的具体程序员开发,而实际开发者可能不属于已知的候选者,而是来自野生的。我们提出了一个有效的二进制文件编制核查框架,BinMLM. BinMM 将RN语言模型用于从控制-流程图(CFG)中提取的连续读码痕迹。我们用多个共享的编码和作者专用的门层构建了一个共享结构,可以学习开发者对通用程序模式的组合偏好,并缓解低培训资源的问题。通过优化外部预培训、联合培训和微调的管道,我们的框架可以消除更多的噪音并准确地淡化开发者的独特风格。 广泛的实验显示,BinMLMMMMM(GCJ) 和代码支持数据组取得了有希望的结果。 与多个共享的共享的共享结构级数据库级数据,可以大大改进。