Machine learning (ML)-based Android malware detection has been one of the most popular research topics in the mobile security community. An increasing number of research studies have demonstrated that machine learning is an effective and promising approach for malware detection, and some works have even claimed that their proposed models could achieve 99\% detection accuracy, leaving little room for further improvement. However, numerous prior studies have suggested that unrealistic experimental designs bring substantial biases, resulting in over-optimistic performance in malware detection. Unlike previous research that examined the detection performance of ML classifiers to locate the causes, this study employs Explainable AI (XAI) approaches to explore what ML-based models learned during the training process, inspecting and interpreting why ML-based malware classifiers perform so well under unrealistic experimental settings. We discover that temporal sample inconsistency in the training dataset brings over-optimistic classification performance (up to 99\% F1 score and accuracy). Importantly, our results indicate that ML models classify malware based on temporal differences between malware and benign, rather than the actual malicious behaviors. Our evaluation also confirms the fact that unrealistic experimental designs lead to not only unrealistic detection performance but also poor reliability, posing a significant obstacle to real-world applications. These findings suggest that XAI approaches should be used to help practitioners/researchers better understand how do AI/ML models (i.e., malware detection) work -- not just focusing on accuracy improvement.
翻译:越来越多的研究研究表明,机器学习是检测恶意软件的有效和有希望的方法,有些作品甚至声称,他们提议的模型可以达到99 ⁇ 的检测准确度,几乎没有进一步改进的余地。然而,许多先前的研究都表明,不切实际的实验设计带来了巨大的偏差,导致在检测恶意软件方面过于乐观的性能。与以往研究考察ML分类者检测性能以确定原因的研究不同,本研究采用了可解释的AI(XAI)方法,以探讨基于ML的模型在培训过程中学到了什么,检查和解释基于ML的恶意软件分类师为何在不切实际的实验环境中表现如此良好。我们发现,培训数据集的时间样本不一致导致过度乐观的分类性业绩(达到99 ⁇ F1分和准确度 ) 。重要的是,我们的研究结果表明,ML模型根据恶意软件与恶意软件之间的时间差异而不是实际的恶意行为进行分类。我们的评估还证实,不现实的实验设计不仅导致不切实际的检测性方法,而且会有助于提高真实性测试性能。