Researchers have proposed kinds of malware detection methods to solve the explosive mobile security threats. We argue that the experiment results are inflated due to the research bias introduced by the variability of malware dataset. We explore the impact of bias in Android malware detection in three aspects, the method used to flag the ground truth, the distribution of malware families in the dataset, and the methods to use the dataset. We implement a set of experiments of different VT thresholds and find that the methods used to flag the malware data affect the malware detection performance directly. We further compare the impact of malware family types and composition on malware detection in detail. The superiority of each approach is different under various combinations of malware families. Through our extensive experiments, we showed that the methods to use the dataset can have a misleading impact on evaluation, and the performance difference can be up to over 40%. We argue that these research biases observed in this paper should be carefully controlled/eliminated to enforce a fair comparison of malware detection techniques. Providing reasonable and explainable results is better than only reporting a high detection accuracy with vague dataset and experimental settings.
翻译:研究人员提出了各种恶意软件检测方法,以解决爆炸性移动安全威胁。我们争辩说,由于恶意软件数据集的变化所带来的研究偏差,实验结果被夸大了。我们探讨了安卓恶意软件检测偏差在三个方面的影响,用于标出地面真相的方法,数据集中恶意软件家庭分布,以及数据集的使用方法。我们实施了一套不同的VT阈值的实验,发现用于标出恶意软件数据的方法直接影响到恶意软件检测的性能。我们进一步比较了恶意软件家庭类型和组成对恶意软件检测的详细影响。在恶意软件家庭的各种组合下,每种方法的优越性是不同的。我们通过广泛的实验,我们发现使用数据集的方法可能会对评估产生误导影响,性能差异可能超过40%。我们说,本文中观察到的这些研究偏差应当谨慎控制/消除,以便公平比较恶意软件检测技术。提供合理和可解释的结果比仅仅报告与模糊的数据集和实验环境的高检测准确性要好得多。