软件脆弱性数据集的数据质量 (Data Quality for Software Vulnerability Datasets)

The use of learning-based techniques to achieve automated software vulnerability detection has been of longstanding interest within the software security domain. These data-driven solutions are enabled by large software vulnerability datasets used for training and benchmarking. However, we observe that the quality of the data powering these solutions is currently ill-considered, hindering the reliability and value of produced outcomes. Whilst awareness of software vulnerability data preparation challenges is growing, there has been little investigation into the potential negative impacts of software vulnerability data quality. For instance, we lack confirmation that vulnerability labels are correct or consistent. Our study seeks to address such shortcomings by inspecting five inherent data quality attributes for four state-of-the-art software vulnerability datasets and the subsequent impacts that issues can have on software vulnerability prediction models. Surprisingly, we found that all the analyzed datasets exhibit some data quality problems. In particular, we found 20-71% of vulnerability labels to be inaccurate in real-world datasets, and 17-99% of data points were duplicated. We observed that these issues could cause significant impacts on downstream models, either preventing effective model training or inflating benchmark performance. We advocate for the need to overcome such challenges. Our findings will enable better consideration and assessment of software vulnerability data quality in the future.

翻译：在软件安全领域,使用基于学习的技术来自动检测软件脆弱性长期以来一直受到关注。这些数据驱动的解决方案是由用于培训和基准测试的大型软件脆弱性数据集促成的。然而,我们注意到,目前对这些解决方案的驱动数据的质量考虑不当,妨碍了所产生结果的可靠性和价值。虽然对软件脆弱性数据编制挑战的认识日益提高,但对软件脆弱性数据质量的潜在负面影响的调查却很少。例如,我们无法确认脆弱性标签是否正确或一致。我们的研究试图通过检查四个最先进的软件脆弱性数据集的五个内在数据质量属性以及随后问题可能对软件脆弱性预测模型产生的影响来弥补这些缺陷。令人惊讶的是,我们发现所有分析的数据集都存在一些数据质量问题。特别是,我们发现实际世界脆弱性数据集中20-71%的脆弱性标签不准确,17-99%的数据点重复了。我们发现,这些问题可能会对下游模型产生重大影响,或者阻碍有效的模型培训,或者使基准性工作升级。我们主张,今后需要更好的数据质量评估。我们主张,需要更好地克服这种脆弱性。

相关内容

数据集

关注 88

数据集，又称为资料集、数据集合或资料集合，是一种由数据所组成的集合。
Data set（或dataset）是一个数据的集合，通常以表格形式出现。每一列代表一个特定变量。每一行都对应于某一成员的数据集的问题。它列出的价值观为每一个变量，如身高和体重的一个物体或价值的随机数。每个数值被称为数据资料。对应于行数，该数据集的数据可能包括一个或多个成员。

高效可扩展图神经网络的研究进展，Recent Advances in Efficient and Scalable Graph Neural Networks

专知会员服务

78+阅读 · 2022年3月15日

【干货书】深度学习合成数据，354页pdf，Synthetic Data for Deep Learning

专知会员服务

104+阅读 · 2022年2月10日

史上最全！358篇机器学习&自然语言处理综述论文！都这儿了

专知会员服务

129+阅读 · 2020年7月18日

【深度学习表格检测、信息提取和结构化】《Table Detection, Information Extraction and Structuring using Deep Learning》by Vihar Kurama

专知会员服务

38+阅读 · 2020年1月23日