Rapid digitalisation spurred by the Covid-19 pandemic has resulted in more cyber crime. Malware-as-a-service is now a booming business for cyber criminals. With the surge in malware activities, it is vital for cyber defenders to understand more about the malware samples they have at hand as such information can greatly influence their next course of actions during a breach. Recently, researchers have shown how malware family classification can be done by first converting malware binaries into grayscale images and then passing them through neural networks for classification. However, most work focus on studying the impact of different neural network architectures on classification performance. In the last year, researchers have shown that augmenting supervised learning with self-supervised learning can improve performance. Even more recently, Data2Vec was proposed as a modality agnostic self-supervised framework to train neural networks. In this paper, we present BinImg2Vec, a framework of training malware binary image classifiers that incorporates both self-supervised learning and supervised learning to produce a model that consistently outperforms one trained only via supervised learning. We were able to achieve a 4% improvement in classification performance and a 0.5% reduction in performance variance over multiple runs. We also show how our framework produces embeddings that can be well clustered, facilitating model explanability.
翻译:由Covid-19大流行引发的快速数字化导致更多的网络犯罪。 Malware- as- A- service现在已成为网络罪犯的兴盛行业。 随着恶意软件活动的激增,网络维护者必须更多地了解他们手中的恶意软件样本,因为这种信息可以极大地影响他们下一个破解过程中的下一步行动。 最近, 研究人员展示了恶意软件家庭分类如何通过首先将恶意软件二进制成灰色图像,然后通过神经系统网络进行分类来完成。 然而, 大部分工作的重点是研究不同神经网络结构对分类绩效的影响。 去年, 研究人员显示通过自我监督的学习加强监督的学习可以提高绩效。 更近些时候, Data2Vec 被提议为培训神经网络的自我监督框架。 在本文中,我们介绍了一个培训恶意软件二进制图像分类的框架, 既包括自我监督的学习,又由监督的学习产生一种模式,只有通过监督的学习才能持续超越一个模式。 我们还能够实现一个模型的改进, 在模型中,我们能够实现一个模型的改进。 我们能够实现一个模型的改进。 在模型的运行中, 展示一个模型的改进, 能够实现一个模型的改进。