从无结构文本中浓缩本体学的深层学习方法 (A Deep Learning Approach for Ontology Enrichment from Unstructured Text)

from arxiv, Accepted as a book chapter in "Cybersecurity & High-Performance Computing Environments: Integrated Innovations, Practices, and Applications", published by Taylor and Francis. arXiv admin note: substantial text overlap with arXiv:2102.04081

Information Security in the cyber world is a major cause for concern, with a significant increase in the number of attack surfaces. Existing information on vulnerabilities, attacks, controls, and advisories available on the web provides an opportunity to represent knowledge and perform security analytics to mitigate some of the concerns. Representing security knowledge in the form of ontology facilitates anomaly detection, threat intelligence, reasoning and relevance attribution of attacks, and many more. This necessitates dynamic and automated enrichment of information security ontologies. However, existing ontology enrichment algorithms based on natural language processing and ML models have issues with contextual extraction of concepts in words, phrases, and sentences. This motivates the need for sequential Deep Learning architectures that traverse through dependency paths in text and extract embedded vulnerabilities, threats, controls, products, and other security-related concepts and instances from learned path representations. In the proposed approach, Bidirectional LSTMs trained on a large DBpedia dataset and Wikipedia corpus of 2.8 GB along with Universal Sentence Encoder is deployed to enrich ISO 27001-based information security ontology. The model is trained and tested on a high-performance computing (HPC) environment to handle Wiki text dimensionality. The approach yielded a test accuracy of over 80% when tested with knocked-out concepts from ontology and web page instances to validate the robustness.

翻译：网络世界的信息安全是一个令人关切的主要问题,攻击表面数量大幅增加。关于弱点、攻击、控制和网络上现有的弱点、攻击、控制和咨询意见的现有信息提供了一个机会来代表知识和进行安全分析,以减轻某些关切。以本体学形式代表安全知识,有助于发现异常现象、威胁情报、攻击的推理和关联归属,以及许多其他情况。这就需要动态和自动地丰富信息安全的本体浓缩算法。然而,基于自然语言处理和ML模型的现有本体浓缩算法在文字、词句和句子上对概念进行背景提取的问题。这促使需要通过文字依赖性路径进行连续深层学习结构,从学习路径表中提取嵌入的弱点、威胁、控制、产品和其他与安全有关的概念和实例。在拟议方法中,双向LSTMS系统在大型DBpedia数据集和Wikeb 数据集2.8GB 和通用判词集中进行了培训,以丰富基于ISO 27001 的信息安全的文字、短语和句子学概念。该模型经过培训和测试,从高水平的WKIF系统测试了80级的计算机环境。该模型在测试后,测试了WHFIA级测试了80-C版本。