Cyber Threat Intelligence (CTI) is information describing threat vectors, vulnerabilities, and attacks and is often used as training data for AI-based cyber defense systems such as Cybersecurity Knowledge Graphs (CKG). There is a strong need to develop community-accessible datasets to train existing AI-based cybersecurity pipelines to efficiently and accurately extract meaningful insights from CTI. We have created an initial unstructured CTI corpus from a variety of open sources that we are using to train and test cybersecurity entity models using the spaCy framework and exploring self-learning methods to automatically recognize cybersecurity entities. We also describe methods to apply cybersecurity domain entity linking with existing world knowledge from Wikidata. Our future work will survey and test spaCy NLP tools and create methods for continuous integration of new information extracted from text.
翻译:网络威胁情报(CTI)是描述威胁矢量、脆弱性和攻击的信息,常常被用作AI型网络防御系统的培训数据,如网络安全知识图(CKG),非常需要开发社区可访问的数据集,以培训现有的AI型网络安全管道,高效准确地从CTI中获取有意义的见解。我们从各种开放来源创建了初步的无结构的CTI文集,我们利用这些来源来培训和测试网络安全实体模型,利用SpaCy框架,探索自动识别网络安全实体的自学方法。我们还描述了应用网络安全域实体与Wikigata的现有世界知识相联系的方法。我们今后的工作将调查和测试SpaCy NLP工具,并创建持续整合从文本中提取的新信息的方法。