Natural Language Processing (NLP) has recently gained wide attention in cybersecurity, particularly in Cyber Threat Intelligence (CTI) and cyber automation. Increased connection and automation have revolutionized the world's economic and cultural infrastructures, while they have introduced risks in terms of cyber attacks. CTI is information that helps cybersecurity analysts make intelligent security decisions, that is often delivered in the form of natural language text, which must be transformed to machine readable format through an automated procedure before it can be used for automated security measures. This paper proposes SecureBERT, a cybersecurity language model capable of capturing text connotations in cybersecurity text (e.g., CTI) and therefore successful in automation for many critical cybersecurity tasks that would otherwise rely on human expertise and time-consuming manual efforts. SecureBERT has been trained using a large corpus of cybersecurity text.To make SecureBERT effective not just in retaining general English understanding, but also when applied to text with cybersecurity implications, we developed a customized tokenizer as well as a method to alter pre-trained weights. The SecureBERT is evaluated using the standard Masked Language Model (MLM) test as well as two additional standard NLP tasks. Our evaluation studies show that SecureBERT\footnote{\url{https://github.com/ehsanaghaei/SecureBERT}} outperforms existing similar models, confirming its capability for solving crucial NLP tasks in cybersecurity.
翻译:最近,自然语言处理(NLP)在网络安全方面受到广泛关注,特别是在网络威胁情报(CTI)和网络自动化方面。增强的连接和自动化使世界经济和文化基础设施发生了革命性的变化,同时在网络攻击方面带来了风险。CTI是有助于网络安全分析师作出智能安全决定的信息,这种信息通常以自然语言文本的形式提供,必须通过自动化程序转换成机器可读格式,然后才能用于自动化安全措施。本文建议安全BERT,这是一种网络安全语言模型,能够捕捉网络安全文本(例如CTI)的文字内涵,从而成功地实现许多关键网络安全任务的自动化,而这种网络安全基础设施将依赖人类专门知识和耗费时间的人工努力。安全BERTR已经接受了使用大量网络安全文本的培训。为使安全电子安全分析员不仅能够保留普通的英语理解,而且在应用具有网络安全影响的文本时,我们开发了定制的代用品,以及改变预先培训重量的方法。安全应用标准保护语言模型(MLM)测试以及另外两个标准的NSEWIPR/SUBSSUDRSUD)/SENSUDSUD 。我们的评估研究显示安全系统/Secrefor_BSENSUDSUDSUDSUIAR_B_B_BSUDSUDSUDRSUDSUDSUDSUDSUDSUDSUDSUDS