网络安全文本分析语言模式 (A Language Model for Text Analytics in Cybersecurity)

NLP is a form of artificial intelligence and machine learning concerned with a computer or machine's ability to understand and interpret human language. Language models are crucial in text analytics and NLP since they allow computers to interpret qualitative input and convert it to quantitative data that they can use in other tasks. In essence, in the context of transfer learning, language models are typically trained on a large generic corpus, referred to as the pre-training stage, and then fine-tuned to a specific underlying task. As a result, pre-trained language models are mostly used as a baseline model that incorporates a broad grasp of the context and may be further customized to be used in a new NLP task. The majority of pre-trained models are trained on corpora from general domains, such as Twitter, newswire, Wikipedia, and Web. Such off-the-shelf NLP models trained on general text may be inefficient and inaccurate in specialized fields. In this paper, we propose a cybersecurity language model called SecureBERT, which is able to capture the text connotations in the cybersecurity domain, and therefore could further be used in automation for many important cybersecurity tasks that would otherwise rely on human expertise and tedious manual efforts. SecureBERT is trained on a large corpus of cybersecurity text collected and preprocessed by us from a variety of sources in cybersecurity and the general computing domain. Using our proposed methods for tokenization and model weights adjustment, SecureBERT is not only able to preserve the understanding of general English as most pre-trained language models can do, but also effective when applied to text that has cybersecurity implications.

翻译：语言模型在文字分析和国家语言模型中至关重要,因为这些模型使计算机能够解释质量投入并将其转换为可用于其他任务的量化数据。实质上,在转让学习方面,语言模型通常在大型通用材料上接受培训,称为培训前阶段,然后根据具体的基本任务进行微调。因此,预先培训的语言模型大多被用作基线模型,广泛掌握背景情况,并可能进一步定制用于新的国家语言模型任务。大多数预先培训的模型都是从一般领域,如Twitter、新闻线、维基百科和网络,对公司进行质量投入的解释,并转换成可用于其他任务的量化数据。在转让学习方面,语言模型通常使用大量通用材料,称为培训前阶段,在专门领域,这种现成的NLP模型可能效率不高,不准确。在本文中,我们提出的网络安全语言模型只能反映网络域的文字内涵涵义,因此,因此,在对许多重要的网络安全模型的自动化中可以使用,在安全、安全、安全、安全、安全、安全、安全、安全、安全、安全、安全、安全、安全、安全、安全、安全、安全、安全、安全、安全、安全、安全、安全、安全、安全、安全、安全、安全、安全、安全、安全、安全、安全、安全、安全、安全、安全、安全、安全、安全、安全、安全、安全、安全、安全、安全、安全、安全、安全、安全、安全、安全、安全、安全、安全、安全、安全、安全、安全、安全、安全、安全、安全、安全、安全、安全、安全、安全、安全、安全、安全、安全、安全、安全、安全、安全、安全、安全、安全、安全、安全、安全、安全、安全、安全、安全、安全、安全、安全、安全、安全、安全、安全、安全、安全、安全、安全、安全、安全、安全、安全、安全、安全、安全、安全、安全、安全、安全、安全、安全、安全、安全、安全、安全、安全、安全、安全、安全、安全、安全、安全、安全、安全、安全、安全、安全、安全、安全、安全、安全、安全、安全、安全、安全、安全、安全、安全、安全、安全、安全、安全、安全、安全、安全、安全、安全、安全、安全、安全、安全、安全、安全