Public security vulnerability reports (e.g., CVE reports) play an important role in the maintenance of computer and network systems. Security companies and administrators rely on information from these reports to prioritize tasks on developing and deploying patches to their customers. Since these reports are unstructured texts, automatic information extraction (IE) can help scale up the processing by converting the unstructured reports to structured forms, e.g., software names and versions and vulnerability types. Existing works on automated IE for security vulnerability reports often rely on a large number of labeled training samples. However, creating massive labeled training set is both expensive and time consuming. In this work, for the first time, we propose to investigate this problem where only a small number of labeled training samples are available. In particular, we investigate the performance of fine-tuning several state-of-the-art pre-trained language models on our small training dataset. The results show that with pre-trained language models and carefully tuned hyperparameters, we have reached or slightly outperformed the state-of-the-art system on this task. Consistent with previous two-step process of first fine-tuning on main category and then transfer learning to others as in [7], if otherwise following our proposed approach, the number of required labeled samples substantially decrease in both stages: 90% reduction in fine-tuning from 5758 to 576,and 88.8% reduction in transfer learning with 64 labeled samples per category. Our experiments thus demonstrate the effectiveness of few-sample learning on NER for security vulnerability report. This result opens up multiple research opportunities for few-sample learning for security vulnerability reports, which is discussed in the paper. Code: https://github.com/guanqun-yang/FewVulnerability.
翻译:公众安全脆弱程度报告(如CVE报告)在维护计算机和网络系统方面发挥着重要作用。安全公司和行政人员依靠这些报告提供的信息,将开发和向客户部署补丁的任务列为优先事项。由于这些报告没有结构化文本,自动信息提取(IE)可以将非结构化报告转换成结构化格式,例如软件名称和版本以及易受灾类型,有助于扩大处理。安全脆弱程度报告自动化信息化工作往往依赖大量贴标签的培训样本。然而,创建大规模标签化培训设施既昂贵又耗时。在这项工作中,我们首次提议在只有少量标签化培训样本的情况下调查这一问题。特别是,我们调查了微调一些非结构化报告(IEE)的性能,将一些非结构化报告转换成结构化格式,例如软件名称和版本和版本。在经过详细调整的超标准度报告中,我们已经达到或略超标准。在这项工作中,创建了多标签式培训的系统既耗资又耗时耗资。在这项工作中,我们首次建议调查这一问题,在前两个步骤过程中,将安全脆弱程度报告进行精细化,从而在主类别中学习减少标数的排序。