Open set recognition (OSR) problem has been a challenge in many machine learning (ML) applications, such as security. As new/unknown malware families occur regularly, it is difficult to exhaust samples that cover all the classes for the training process in ML systems. An advanced malware classification system should classify the known classes correctly while sensitive to the unknown class. In this paper, we introduce a self-supervised pre-training approach for the OSR problem in malware classification. We propose two transformations for the function call graph (FCG) based malware representations to facilitate the pretext task. Also, we present a statistical thresholding approach to find the optimal threshold for the unknown class. Moreover, the experiment results indicate that our proposed pre-training process can improve different performances of different downstream loss functions for the OSR problem.
翻译:公开设置识别(OSR)问题一直是许多机器学习(ML)应用(如安全)中的一项挑战。由于新的/未知的恶意软件家庭经常出现,因此很难用尽覆盖ML系统培训过程所有课程的样本。先进的恶意软件分类系统应该正确分类已知的类别,同时对未知类别敏感。在本文中,我们在恶意软件分类中引入了一种自我监督的OSSR问题培训前方法。我们建议对功能调用图(FCG)基于恶意软件的描述进行两次转换,以便利进行借口任务。此外,我们提出了一个统计门槛化方法,为未知类别找到最佳门槛值。此外,实验结果表明,我们拟议的培训前程序可以改进不同下游损失功能的操作。