Nowadays, malware and malware incidents are increasing daily, even with various anti-viruses systems and malware detection or classification methodologies. Many static, dynamic, and hybrid techniques have been presented to detect malware and classify them into malware families. Dynamic and hybrid malware classification methods have advantages over static malware classification methods by being highly efficient. Since it is difficult to mask malware behavior while executing than its underlying code in static malware classification, machine learning techniques have been the main focus of the security experts to detect malware and determine their families dynamically. The rapid increase of malware also brings the necessity of recent and updated datasets of malicious software. We introduce two new, updated datasets in this work: One with 9,795 samples obtained and compiled from VirusSamples and the one with 14,616 samples from VirusShare. This paper also analyzes multi-class malware classification performance of the balanced and imbalanced version of these two datasets by using Histogram-based gradient boosting, Random Forest, Support Vector Machine, and XGBoost models with API call-based dynamic malware classification. Results show that Support Vector Machine, achieves the highest score of 94% in the imbalanced VirusSample dataset, whereas the same model has 91% accuracy in the balanced VirusSample dataset. While XGBoost, one of the most common gradient boosting-based models, achieves the highest score of 90% and 80%.in both versions of the VirusShare dataset. This paper also presents the baseline results of VirusShare and VirusSample datasets by using the four most widely known machine learning techniques in dynamic malware classification literature. We believe that these two datasets and baseline results enable researchers in this field to test and validate their methods and approaches.
翻译:目前,恶意软件和恶意软件事件正在日复一日地增加,即使有各种反病毒系统和恶意软件检测或分类方法。许多静态、动态和混合技术已经推出,以检测恶意软件并将其分类为恶意软件家庭。动态和混合的恶意软件分类方法由于效率很高,对静态恶意软件分类方法具有优势。由于很难在静态恶意软件分类中执行比其基本代码更平衡和不平衡的错误软件分类时掩盖恶意行为,因此机器学习技术一直是安全专家发现恶意软件和动态确定其家庭的主要焦点。恶意软件的迅速增加也带来了最新和更新恶意软件数据集的必要性。我们在这项工作中引入了两个新的更新数据集:一个是9,795个样本,而一个是病毒样本,而另一个是14,616个样本。本文还分析了这两套数据集的多级恶意软件的分类性能,同时使用了基于直观梯度梯度的梯度加速度、随机森林、支持矢量机和XOO的模型, 以及基于API的动态恶意软件的动态软件分类。结果显示,大多数VERSerma Ral sal sal sal sal deal sal sal sal sal sal sal sal sal sal sal sal sal sal sal sal salation sal slation slation slation sal sal slational slation sal sal sal sal sal slation sal sal sal slation slation slation slation slation slation slation slation slation slation slation slational sal sal slationald slations slations sal sal sal sal sal sal sal sal sal sal sal sal sal sal sal sal sal sal sal sal sal sal sal sal sal sal sal sal sal sal sald sald sald sal sal sal sal sal sal sal sal sal sal sal sal sal sal sal sal sal sal sal sal sald sal sal sal sal sal sal sal sal sal sal sal s