Machine learning (ML) has become an important paradigm for cyberthreat detection (CTD) in the recent years. A substantial research effort has been invested in the development of specialized algorithms for CTD tasks. From the operational perspective, however, the progress of ML-based CTD is hindered by the difficulty in obtaining the large sets of labelled data to train ML detectors. A potential solution to this problem are semisupervised learning (SsL) methods, which combine small labelled datasets with large amounts of unlabelled data. This paper is aimed at systematization of existing work on SsL for CTD and, in particular, on understanding the utility of unlabelled data in such systems. To this end, we analyze the cost of labelling in various CTD tasks and develop a formal cost model for SsL in this context. Building on this foundation, we formalize a set of requirements for evaluation of SsL methods, which elucidates the contribution of unlabelled data. We review the state-of-the-art and observe that no previous work meets such requirements. To address this problem, we propose a framework for assessing the benefits of unlabelled data in SsL. We showcase an application of this framework by performing the first benchmark evaluation that highlights the tradeoffs of 9 existing SsL methods on 9 public datasets. Our findings verify that, in some cases, unlabelled data provides a small, but statistically significant, performance gain. This paper highlights that SsL in CTD has a lot of room for improvement, which should stimulate future research in this field.
翻译:近些年来,机器学习(ML)已成为网络威胁探测(CTD)的一个重要范例。在为CTD任务开发专门算法方面已经投入了大量的研究努力。然而,从业务的角度来看,基于ML的CTD的进展由于难以获得用于培训ML探测器的大批贴标签数据而受阻。这个问题的潜在解决办法是半监督的学习方法(SL),这种方法将小型标签数据集与大量未贴标签的数据结合起来。本文旨在系统化现有的CTDSL工作,特别是了解这类系统中未贴标签数据的效用。为此,我们分析以ML为基础的CTD各项任务贴标签的成本,并在此背景下为SL开发一个正式的成本模型。在此基础上,我们正式确定一套SL方法的评价要求,其中阐明了未贴标签数据的贡献。我们审查了目前的工作状况,但发现以前没有达到这种要求。为解决这一问题,我们提出了一个框架,用于评估未贴标签的SL数据绩效评估的一些重要成本。我们用SL标准在SL数据库中展示了这一没有标定标准的数据。