With the increase in cybersecurity vulnerabilities of software systems, the ways to exploit them are also increasing. Besides these, malware threats, irregular network interactions, and discussions about exploits in public forums are also on the rise. To identify these threats faster, to detect potentially relevant entities from any texts, and to be aware of software vulnerabilities, automated approaches are necessary. Application of natural language processing (NLP) techniques in the Cybersecurity domain can help in achieving this. However, there are challenges such as the diverse nature of texts involved in the cybersecurity domain, the unavailability of large-scale publicly available datasets, and the significant cost of hiring subject matter experts for annotations. One of the solutions is building multi-task models that can be trained jointly with limited data. In this work, we introduce a generative multi-task model, Unified Text-to-Text Cybersecurity (UTS), trained on malware reports, phishing site URLs, programming code constructs, social media data, blogs, news articles, and public forum posts. We show UTS improves the performance of some cybersecurity datasets. We also show that with a few examples, UTS can be adapted to novel unseen tasks and the nature of data
翻译:随着软件系统网络安全脆弱性的增加,利用这些系统的方法也在增加。此外,恶意威胁、网络不规则互动和关于公共论坛利用情况的讨论也在增加。为了更快地查明这些威胁,从任何文本中发现潜在相关实体,并了解软件脆弱性,有必要采用自动化方法。在网络安全领域应用自然语言处理技术可以帮助实现这一目标。然而,在网络安全领域应用自然语言处理技术可以帮助实现这一目标。但存在各种挑战,例如网络安全领域所涉及的文本性质多种多样,无法提供大规模公开的数据集,以及聘请专题专家作说明的费用很高。其中一个解决办法是建立多任务模型,这些模型可以用有限的数据共同培训。在这项工作中,我们引入了基因化的多任务模型、统一的文本到软件网络安全(UTS),对恶意报告进行了培训,phishing网站URL、编程代码构建、社会媒体数据、博客、新闻文章和公共论坛文章等。我们用几个例子显示,UTS改进了某些网络安全数据集的性能。我们还展示,用几个例子显示,UTS的性质可以适应新式的自然数据,以新式的外观。