As in other cybersecurity areas, machine learning (ML) techniques have emerged as a promising solution to detect Android malware. In this sense, many proposals employing a variety of algorithms and feature sets have been presented to date, often reporting impresive detection performances. However, the lack of reproducibility and the absence of a standard evaluation framework make these proposals difficult to compare. In this paper, we perform an analysis of 10 influential research works on Android malware detection using a common evaluation framework. We have identified five factors that, if not taken into account when creating datasets and designing detectors, significantly affect the trained ML models and their performances. In particular, we analyze the effect of (1) the presence of duplicated samples, (2) label (goodware/greyware/malware) attribution, (3) class imbalance, (4) the presence of apps that use evasion techniques and, (5) the evolution of apps. Based on this extensive experimentation, we conclude that the studied ML-based detectors have been evaluated optimistically, which justifies the good published results. Our findings also highlight that it is imperative to generate realistic experimental scenarios, taking into account the aforementioned factors, to foster the rise of better ML-based Android malware detection solutions.
翻译:与其他网络安全领域一样,机器学习(ML)技术已成为探测安非他明恶意软件的一个大有希望的解决方案。从这个意义上讲,迄今为止,提出了许多使用各种算法和功能组的建议,并经常报告定额备用检测绩效。然而,由于缺乏复制性和标准评价框架,因此难以比较这些建议。在本文件中,我们分析了10项利用共同评价框架对安非他明恶意软件检测进行的有影响的研究项目。我们查明了五个因素,如果在创建数据集和设计探测器时没有考虑到这些因素,就会对经过培训的ML模型及其性能产生重大影响。特别是,我们分析了以下因素的影响:(1) 存在重复的样本,(2) 标签(货物/灰质软件/软件)属性,(3) 阶级不平衡,(4) 使用规避技术的应用程序的存在,(5) 应用程序的演进。根据这一广泛的实验,我们得出结论,已经对研究过的基于ML的探测器进行了乐观的评价,这证明发表良好结果是有道理的。我们的调查结果还强调指出,考虑到上述因素,必须产生现实的实验情景,从而推动更好地检测磁性软件。