Gathering cyber threat intelligence from open sources is becoming increasingly important for maintaining and achieving a high level of security as systems become larger and more complex. However, these open sources are often subject to information overload. It is therefore useful to apply machine learning models that condense the amount of information to what is necessary. Yet, previous studies and applications have shown that existing classifiers are not able to extract specific information about emerging cybersecurity events due to their low generalization ability. Therefore, we propose a system to overcome this problem by training a new classifier for each new incident. Since this requires a lot of labelled data using standard training methods, we combine three different low-data regime techniques - transfer learning, data augmentation, and few-shot learning - to train a high-quality classifier from very few labelled instances. We evaluated our approach using a novel dataset derived from the Microsoft Exchange Server data breach of 2021 which was labelled by three experts. Our findings reveal an increase in F1 score of more than 21 points compared to standard training methods and more than 18 points compared to a state-of-the-art method in few-shot learning. Furthermore, the classifier trained with this method and 32 instances is only less than 5 F1 score points worse than a classifier trained with 1800 instances.
翻译:从开放来源收集网络威胁情报对维持和实现高度安全越来越重要,因为系统变大、变复杂。然而,这些公开来源往往面临信息超载,因此,应用将信息量压缩到必要程度的机器学习模式是有益的。然而,以往的研究和应用表明,现有的分类人员由于一般化能力低,无法获取关于新出现的网络安全事件的具体信息。因此,我们建议建立一个系统,通过为每起新事件培训一个新的分类员来解决这一问题。由于这需要使用标准培训方法大量贴标签的数据,我们结合了三种不同的低数据系统技术——转让学习、数据增强和少见的学习——从极少的标签实例中培训高质量的分类人员。我们用三个专家所标的2021年Microsoft Exchange服务器数据失密的新数据集评估了我们的方法。我们的调查结果显示,与标准培训方法相比,F1分超过21分,比低18分,少发学中最先进的方法为18分。此外,经过这种方法培训的叙级比F1分数只有18分。