The use of learning-based techniques to achieve automated software vulnerability detection has been of longstanding interest within the software security domain. These data-driven solutions are enabled by large software vulnerability datasets used for training and benchmarking. However, we observe that the quality of the data powering these solutions is currently ill-considered, hindering the reliability and value of produced outcomes. Whilst awareness of software vulnerability data preparation challenges is growing, there has been little investigation into the potential negative impacts of software vulnerability data quality. For instance, we lack confirmation that vulnerability labels are correct or consistent. Our study seeks to address such shortcomings by inspecting five inherent data quality attributes for four state-of-the-art software vulnerability datasets and the subsequent impacts that issues can have on software vulnerability prediction models. Surprisingly, we found that all the analyzed datasets exhibit some data quality problems. In particular, we found 20-71% of vulnerability labels to be inaccurate in real-world datasets, and 17-99% of data points were duplicated. We observed that these issues could cause significant impacts on downstream models, either preventing effective model training or inflating benchmark performance. We advocate for the need to overcome such challenges. Our findings will enable better consideration and assessment of software vulnerability data quality in the future.
翻译:在软件安全领域,使用基于学习的技术来自动检测软件脆弱性长期以来一直受到关注。这些数据驱动的解决方案是由用于培训和基准测试的大型软件脆弱性数据集促成的。然而,我们注意到,目前对这些解决方案的驱动数据的质量考虑不当,妨碍了所产生结果的可靠性和价值。虽然对软件脆弱性数据编制挑战的认识日益提高,但对软件脆弱性数据质量的潜在负面影响的调查却很少。例如,我们无法确认脆弱性标签是否正确或一致。我们的研究试图通过检查四个最先进的软件脆弱性数据集的五个内在数据质量属性以及随后问题可能对软件脆弱性预测模型产生的影响来弥补这些缺陷。令人惊讶的是,我们发现所有分析的数据集都存在一些数据质量问题。特别是,我们发现实际世界脆弱性数据集中20-71%的脆弱性标签不准确,17-99%的数据点重复了。我们发现,这些问题可能会对下游模型产生重大影响,或者阻碍有效的模型培训,或者使基准性工作升级。我们主张,今后需要更好的数据质量评估。我们主张,需要更好地克服这种脆弱性。