Deep Neural Networks (DNNs) have been extensively used in many areas including image processing, medical diagnostics, and autonomous driving. However, DNNs can exhibit erroneous behaviours that may lead to critical errors, especially when used in safety-critical systems. Inspired by testing techniques for traditional software systems, researchers have proposed neuron coverage criteria, as an analogy to source code coverage, to guide the testing of DNN models. Despite very active research on DNN coverage, several recent studies have questioned the usefulness of such criteria in guiding DNN testing. Further, from a practical standpoint, these criteria are white-box as they require access to the internals or training data of DNN models, which is in many contexts not feasible or convenient. In this paper, we investigate black-box input diversity metrics as an alternative to white-box coverage criteria. To this end, we first select and adapt three diversity metrics and study, in a controlled manner, their capacity to measure actual diversity in input sets. We then analyse their statistical association with fault detection using two datasets and three DNN models. We further compare diversity with state-of-the-art white-box coverage criteria. Our experiments show that relying on the diversity of image features embedded in test input sets is a more reliable indicator than coverage criteria to effectively guide the testing of DNNs. Indeed, we found that one of our selected black-box diversity metrics far outperforms existing coverage criteria in terms of fault-revealing capability and computational time. Results also confirm the suspicions that state-of-the-art coverage metrics are not adequate to guide the construction of test input sets to detect as many faults as possible with natural inputs.
翻译:深神经网络(DNN)在许多领域被广泛使用,包括图像处理、医疗诊断和自主驾驶。然而,DNN可能表现出可能导致重大错误的错误行为,特别是在安全临界系统中使用时。在传统软件系统的测试技术的启发下,研究人员提出了神经覆盖标准,作为源代码覆盖的类比,以指导DNN模型的测试。尽管对DNN覆盖范围进行了非常积极的研究,但最近的一些研究质疑了这些标准在指导DNN测试中的效用。此外,从实际角度讲,这些标准是白色框,因为它们需要获取DNN模型的内部数据或培训数据,而DNNM模型在许多场合是行不通或不方便的。在本文中,我们调查黑盒输入多样性指标的多样性,作为白箱覆盖标准的替代。我们首先选择并调整了三种多样性指标,以受控方式衡量输入数据集的实际多样性的能力。我们随后用两个数据集和三个DNNN模型来分析它们与错误检测的统计关联性关系。我们进一步比较了DNNN模型内部内部内部的错误数据覆盖范围或培训数据,这在很多情况下是不可行或不方便的。我们用一个标准标准中,我们用一个标准测试标准测试一个标准中可能测试一个标准,我们用一个标准测试标准中的一种标准来测试标准,用来测试。我们用一个模型测试标准,用来测试一个模型的模型的模型的模型的模型的模型的模型的模型测试。我们用一个模型的模型的模型的模型的模型的模型的模型的模型的模型的模型的模型测试。我们用一个比一个模型的模型的模型的模型,用来测试。我们用一个比一个模型的模型的模型的模型的模型的模型的模型的模型的模型的模型的模型的模型的模型的模型的模型的模型的模型,用来测试。我们用一个比一个比一个比一个模型的模型的模型的模型的模型的模型的模型的模型的模型的模型的模型的模型的模型的模型的模型的模型的模型的模型的模型的模型的模型的模型的模型的模型的模型的模型的模型的模型的模型的模型的模型的模型的模型的模型的模型的模型的模型的模型的模型的模型的模型的模型的模型的模型的模型的模型的模型的模型的模型的模型的模型的模型的模型的模型的模型的模型