Testing Deep Learning (DL) based systems inherently requires large and representative test sets to evaluate whether DL systems generalise beyond their training datasets. Diverse Test Input Generators (TIGs) have been proposed to produce artificial inputs that expose issues of the DL systems by triggering misbehaviours. Unfortunately, such generated inputs may be invalid, i.e., not recognisable as part of the input domain, thus providing an unreliable quality assessment. Automated validators can ease the burden of manually checking the validity of inputs for human testers, although input validity is a concept difficult to formalise and, thus, automate. In this paper, we investigate to what extent TIGs can generate valid inputs, according to both automated and human validators. We conduct a large empirical study, involving 2 different automated validators, 220 human assessors, 5 different TIGs and 3 classification tasks. Our results show that 84% artificially generated inputs are valid, according to automated validators, but their expected label is not always preserved. Automated validators reach a good consensus with humans (78% accuracy), but still have limitations when dealing with feature-rich datasets.
翻译:深学习测试( DL) 基于深学习( DL) 的系统本身就要求大型且具有代表性的测试组来评价 DL 系统是否在培训数据集之外泛泛化。 不同测试输入生成器( TIG) 已经建议通过触发错误行为来产生暴露 DL 系统问题的人工输入。 不幸的是, 这种生成的输入可能是无效的, 也就是说, 无法被辨别为输入域的一部分, 从而提供不可靠的质量评估。 自动验证器可以减轻人工检查输入对人类测试器有效性的负担, 尽管输入有效性是一个难以正规化的概念, 因而是自动化的。 在本文中, 我们根据自动化和人文验证器, 已经调查了 TIG 能够产生有效投入的程度。 我们进行了一项大型的经验性研究, 涉及两个不同的自动验证器、 220 人文评估器、 5 不同的 TIG 和 3 分类任务。 我们的结果表明, 84% 人工生成的投入根据自动验证器是有效的, 但预期的标签并非总能保存。 自动验证器与人类达成良好共识( 78% ), 但处理地貌数据集时仍然有局限性 。