Large amounts of threat intelligence information about mal-ware attacks are available in disparate, typically unstructured, formats. Knowledge graphs can capture this information and its context using RDF triples represented by entities and relations. Sparse or inaccurate threat information, however, leads to challenges such as incomplete or erroneous triples. Named entity recognition (NER) and relation extraction (RE) models used to populate the knowledge graph cannot fully guaran-tee accurate information retrieval, further exacerbating this problem. This paper proposes an end-to-end approach to generate a Malware Knowledge Graph called MalKG, the first open-source automated knowledge graph for malware threat intelligence. MalKG dataset called MT40K1 contains approximately 40,000 triples generated from 27,354 unique entities and 34 relations. We demonstrate the application of MalKGin predicting missing malware threat intelligence information in the knowledge graph. For ground truth, we manually curate a knowledge graph called MT3K, with 3,027 triples generated from 5,741 unique entities and 22 relations. For entity prediction via a state-of-the-art entity prediction model(TuckER), our approach achieves 80.4 for the hits@10 metric (predicts the top 10 options for missing entities in the knowledge graph), and 0.75 for the MRR (mean reciprocal rank). We also propose a framework to automate the extraction of thousands of entities and relations into RDF triples, both manually and automatically, at the sentence level from1,100 malware threat intelligence reports and from the com-mon vulnerabilities and exposures (CVE) database.
翻译:有关恶意袭击的大量威胁情报信息以不同形式提供,通常没有结构化格式。 知识图表可以使用实体和关系代表的 RDF 三倍来捕捉这些信息及其背景。 但是, 粗略或不准确的威胁信息会带来不完全或错误的三重挑战。 命名实体识别(NER) 和关系提取(RE) 模型用于传播知识图表, 无法完全夸拉- 准确的信息检索, 使这一问题进一步恶化。 本文提出一个端到端方法, 以生成一个名为 MalKG 的MalKG 的Malwar知识暴露图, 这是用于恶意威胁情报的首个开放源自动自动知识图表。 MalKG 数据集名为MT40K1, 包含来自27, 354个独特实体和34个关系中的大约40,000个三重数据。 我们用MalKGin 来预测知识图中缺失的恶意威胁情报信息信息信息信息。 关于地面真相,我们手工整理了一个名为MT3K, 3,02727个独特的实体和22个关系。 实体通过一个状态实体预测模型(TuckER5) 和10级数据库的高级数据库, 将80个最高数据库用于10号数据库, 和10号数据库中的最新数据选择。