Extracting cybersecurity entities such as attackers and vulnerabilities from unstructured network texts is an important part of security analysis. However, the sparsity of intelligence data resulted from the higher frequency variations and the randomness of cybersecurity entity names makes it difficult for current methods to perform well in extracting security-related concepts and entities. To this end, we propose a semantic augmentation method which incorporates different linguistic features to enrich the representation of input tokens to detect and classify the cybersecurity names over unstructured text. In particular, we encode and aggregate the constituent feature, morphological feature and part of speech feature for each input token to improve the robustness of the method. More than that, a token gets augmented semantic information from its most similar K words in cybersecurity domain corpus where an attentive module is leveraged to weigh differences of the words, and from contextual clues based on a large-scale general field corpus. We have conducted experiments on the cybersecurity datasets DNRTI and MalwareTextDB, and the results demonstrate the effectiveness of the proposed method.
翻译:安全分析的一个重要部分是从非结构化网络文本中提取网络安全实体,如攻击者和脆弱性等网络安全实体,但是,由于网络安全实体名称的频率变化和随机性,导致情报数据的广度较高,使得目前采用的方法难以很好地提取与安全有关的概念和实体。为此,我们提议采用一种语义增强方法,其中含有不同语言特征,以丰富输入符号的表述方式,从而在非结构化文本中检测和分类网络安全名称。特别是,我们为每个输入符号的构成特征、形态特征和部分语音特征进行编码和汇总,以提高方法的稳健性。不仅如此,在网络安全域域中最相似的K字眼中,利用一个专注模块来权衡字数的差异,以及基于大规模一般实地文件的背景线索,我们进行了网络安全数据集DNRTI和MalwareTextDB的实验,结果证明了拟议方法的有效性。