Context: Mutation Testing (MT) is an important tool in traditional Software Engineering (SE) white-box testing. It aims to artificially inject faults in a system to evaluate a test suite's capability to detect them, assuming that the test suite defects finding capability will then translate to real faults. If MT has long been used in SE, it is only recently that it started gaining the attention of the Deep Learning (DL) community, with researchers adapting it to improve the testability of DL models and improve the trustworthiness of DL systems. Objective: If several techniques have been proposed for MT, most of them neglected the stochasticity inherent to DL resulting from the training phase. Even the latest MT approaches in DL, which propose to tackle MT through a statistical approach, might give inconsistent results. Indeed, as their statistic is based on a fixed set of sampled training instances, it can lead to different results across instances set when results should be consistent for any instance. Methods: In this work, we propose a Probabilistic Mutation Testing (PMT) approach that alleviates the inconsistency problem and allows for a more consistent decision on whether a mutant is killed or not. Results: We show that PMT effectively allows a more consistent and informed decision on mutations through evaluation using three models and eight mutation operators used in previously proposed MT methods. We also analyze the trade-off between the approximation error and the cost of our method, showing that relatively small error can be achieved for a manageable cost. Conclusion: Our results showed the limitation of current MT practices in DNN and the need to rethink them. We believe PMT is the first step in that direction which effectively removes the lack of consistency across test executions of previous methods caused by the stochasticity of DNN training.
翻译:在传统软件工程(SE) 白箱测试中, 磁性测试( MT) 是一个重要的工具。 它旨在人为地在测试套件检测能力评估系统中输入错误, 假设测试套件发现缺陷的能力会转化成真正的错误。 如果测试套件发现能力在SE中长期使用, 只是最近它才开始获得深层学习(DL)社区的注意, 研究人员调整它来提高 DL 模型的可测试性, 并提高 DL 系统的信任度。 目标 : 如果为MT 提出了几种技术, 其中多数技术忽视了 DL 所固有的检测能力。 甚至DL 中最新的MT 方法( 提议通过统计方法处理 MT ) 可能会产生不一致的结果 。 事实上, 它们的统计依据是一套固定的抽样培训实例, 研究人员可以对它进行调整, 提高 DL 模型的可提高 DL 的可测试效果。 方法: 我们通过这项工作, 提出了一种可稳定性测试( PMT) 方法, 从而减轻当前不一致性问题, 并允许一种更精确的操作者 。