Malware is a major threat to computer systems and imposes many challenges to cyber security. Targeted threats, such as ransomware, cause millions of dollars in losses every year. The constant increase of malware infections has been motivating popular antiviruses (AVs) to develop dedicated detection strategies, which include meticulously crafted machine learning (ML) pipelines. However, malware developers unceasingly change their samples' features to bypass detection. This constant evolution of malware samples causes changes to the data distribution (i.e., concept drifts) that directly affect ML model detection rates, something not considered in the majority of the literature work. In this work, we evaluate the impact of concept drift on malware classifiers for two Android datasets: DREBIN (about 130K apps) and a subset of AndroZoo (about 350K apps). We used these datasets to train an Adaptive Random Forest (ARF) classifier, as well as a Stochastic Gradient Descent (SGD) classifier. We also ordered all datasets samples using their VirusTotal submission timestamp and then extracted features from their textual attributes using two algorithms (Word2Vec and TF-IDF). Then, we conducted experiments comparing both feature extractors, classifiers, as well as four drift detectors (DDM, EDDM, ADWIN, and KSWIN) to determine the best approach for real environments. Finally, we compare some possible approaches to mitigate concept drift and propose a novel data stream pipeline that updates both the classifier and the feature extractor. To do so, we conducted a longitudinal evaluation by (i) classifying malware samples collected over nine years (2009-2018), (ii) reviewing concept drift detection algorithms to attest its pervasiveness, (iii) comparing distinct ML approaches to mitigate the issue, and (iv) proposing an ML data stream pipeline that outperformed literature approaches.
翻译:恶意软件是计算机系统的一大威胁,给网络安全带来许多挑战。 目标威胁, 如赎金软件, 导致每年损失数百万美元。 恶意软件感染的持续增长, 促使流行的抗病毒(AV) 开发专门的检测策略, 包括精心设计的机器学习管道。 然而, 恶意软件开发者不老化地改变其样本特征以绕开检测。 恶意软件样本的不断演变导致数据分布的变化( i.e. 概念漂移) 直接影响到ML 模型检测率, 而这在大部分文献工作中并没有被考虑。 在这项工作中, 我们评估了恶意软件分类对两个自动机器人数据集( DREBIN (约130K Apps) 和 安得罗祖( 约350KMpps) 的子集。 我们用这些数据集来训练适应性随机随机随机随机森林( ARF), 以及 Stocrial Greator (Scial Greadal), 我们用两个服务器的提交时间变压方法订购了所有数据集,, 然后从服务器变现了概念流数据, 做了一个变压, 和变压。