The constantly increasing number of threats and the existing diversity of information sources pose challenges for Computer Emergency Response Teams (CERTs). In order to respond to new threats, CERTs need to gather information in a timely and comprehensive manner. However, the volume of information and sources can lead to information overload. This paper answers the question of how to reduce information overload for CERTs with the help of clustering methods. Conditions for such a framework were established and subsequently tested. In order to perform an evaluation, different types of evaluation metrics were introduced and selected in relation to the framework conditions. Furthermore, different vectorizations and distance measures in combination with the clustering methods were evaluated and interpreted. Two different ground-truth datasets were used for the evaluation, one containing threat messages and a dataset with messages from different news categories. The work shows that the K-means clustering method along with TF-IDF vectorization and cosine distance provide the best results in the domain of threat messages.
翻译:不断增多的威胁和现有信息源的多样性对计算机应急小组构成了挑战。为了应对新的威胁,计算机应急小组需要及时和全面地收集信息。然而,信息和源量可能会导致信息超载。本文件回答如何在集群方法的帮助下减少计算机应急小组信息超载的问题。建立并随后测试了这种框架的条件。为了进行评价,根据框架条件引入并选择了不同类型的评价指标。此外,对与集群方法相结合的不同矢量和距离措施进行了评估和解释。评价使用了两种不同的地面真相数据集,其中一种含有威胁信息,另一套含有不同类别新闻信息。工作表明,K手段集群方法与TF-IDF矢量化和连接距离一道提供了威胁信息领域的最佳结果。