In this work, we conducted a study on building an automated testing system for deep learning systems based on differential behavior criteria. The automated testing goals were achieved by jointly optimizing two objective functions: maximizing differential behaviors from models under testing and maximizing neuron coverage. By observing differential behaviors from three pre-trained models during each testing iteration, the input image that triggered erroneous feedback was registered as a corner-case. The generated corner-cases can be used to examine the robustness of DNNs and consequently improve model accuracy. A project called DeepXplore was also used as a baseline model. After we fully implemented and optimized the baseline system, we explored its application as an augmenting training dataset with newly generated corner cases. With the GTRSB dataset, by retraining the model based on automated generated corner cases, the accuracy of three generic models increased by 259.2%, 53.6%, and 58.3%, respectively. Further, to extend the capability of automated testing, we explored other approaches based on differential behavior criteria to generate photo-realistic images for deep learning systems. One approach was to apply various transformations to the seed images for the deep learning framework. The other approach was to utilize the Generative Adversarial Networks (GAN) technique, which was implemented on MNIST and Driving datasets. The style transferring capability has been observed very effective in adding additional visual effects, replacing image elements, and style-shifting (virtual image to real images). The GAN-based testing sample generation system was shown to be the next frontier for automated testing for deep learning systems.
翻译:在这项工作中,我们进行了一项关于为基于不同行为标准的深层学习系统建立自动测试系统的研究。自动化测试目标是通过共同优化两个客观功能来实现的:从测试中的模型中最大限度地扩大差异行为,并最大限度地扩大神经覆盖。通过在每次测试转接过程中观察三个经过预先培训的模型的差异行为,触发错误反馈的输入图像被登记为转角情况。生成的角落情况可用于检查DNN的稳健性,从而提高模型的准确性。一个名为DeepXplore的项目也被用作基线模型。在我们完全实施和优化基线系统之后,我们探索了它的应用,将其作为一个用新生成的角案例增强的培训数据集。GTRSB数据集,通过对基于自动生成的转角案例的模型进行再培训,三个通用模型的准确性分别增加了259.2%、53.6%和58.3%。此外,为了扩大自动化测试能力,我们探索了基于不同行为标准的其他方法,为深层学习系统生成了照片-现实性图像。一种方法是对种子图像进行各种变换,在深层学习框架中,DRAN采用了其他数据转换方法。