Malware authors reuse the same program segments found in other applications for performing the similar kind of malicious activities such as information stealing, sending SMS and so on. Hence, there may exist several semantically similar malware samples in a family/dataset. Many researchers unaware about these semantically similar apps and use their features in their ML models for evaluation. Hence, the performance measures might be seriously affected by these similar kinds of apps. In this paper, we study the impact of semantically similar applications in the performance measures of ML based Android malware detectors. For this, we propose a novel opcode subsequence based malware clustering algorithm to identify the semantically similar malware and goodware apps. For studying the impact of semantically similar apps in the performance measures, we tested the performance of distinct ML models based on API call and permission features of malware and goodware application with/without semantically similar apps. In our experimentation with Drebin dataset, we found that, after removing the exact duplicate apps from the dataset (? = 0) the malware detection rate (TPR) of API call based ML models is dropped from 0.95 to 0.91 and permission based model is dropped from 0.94 to 0.90. In order to overcome this issue, we advise the research community to use our clustering algorithm to get rid of semantically similar apps before evaluating their malware detection mechanism.
翻译:Malware 作者重新使用其他应用程序中为从事类似恶意活动而发现的类似程序部分,如信息盗窃、发送短信等。 因此,家庭/数据集中可能存在若干类似恶意软件的精密相似样本。 许多研究人员不知道这些内容相似的应用程序,并在 ML 模型中使用其特性进行评价。 因此,这些类似的应用程序可能会严重影响到性能措施。 在本文件中,我们研究了基于 ML 和机器人的恶意软件探测器等类似应用中,在性能措施中,基于 ML 和机器人的恶意软件检测器等类似应用程序的音义相似应用程序的影响。 为此,我们提议了一个新的基于软件序列的以子序列为主的恶意软件组群子子序列为主的恶意软件组群群组组合算算算算算算法,为了研究性能措施中类似软件应用程序的影响,我们测试了不同的 ML 模型和软件应用的允许特性。 在与 Drebin 数据集 的实验中,我们发现,在从数据集集(= 0) 类似恶意软件组群群群群群群群群群群群群群群群群群群群群群群群群群群群群群群群群群群群群群群群群群群比重重程序检测、以MLSLSLSLSLSLSLSLSLSLSLRRRMSMSMSMSMSMSMS 的检测法 。