The extensive damage caused by malware requires anti-malware systems to be constantly improved to prevent new threats. The current trend in malware detection is to employ machine learning models to aid in the classification process. We propose a new dataset with the objective of improving current anti-malware systems. The focus of this dataset is to improve host based intrusion detection systems by providing API call sequences for thousands of malware samples executed in Windows 10 virtual machines. A tutorial on how to create and expand this dataset is provided along with a benchmark demonstrating how to use this dataset to classify malware. The data contains long sequences of API calls for each sample, and in order to create models that can be deployed in resource constrained devices, three feature selection methods were tested. The principal innovation, however, lies in the multi-label classification system in which one sequence of APIs can be tagged with multiple labels describing its malicious behaviours.
翻译:恶意软件造成的广泛损坏要求不断改进反恶意软件系统,以防止新的威胁。目前恶意软件检测的趋势是使用机器学习模型来帮助分类过程。我们提议了一个新的数据集,目的是改进目前的反恶意软件系统。这个数据集的重点是改进基于主机的入侵检测系统,为在Windows 10虚拟机器中执行的数千个恶意软件样本提供API呼叫序列。提供了关于如何创建和扩大该数据集的教程,同时提供了一个基准,表明如何使用该数据集对恶意软件进行分类。数据包含每个样本的API的长序列,并且为了创建可以在资源限制装置中部署的模型,测试了三种特征选择方法。但是,主要的创新在于多标签分类系统,在这个系统中可以用多个标签标注一个序列的反恶意行为。