Game-like programs have become increasingly popular in many software engineering domains such as mobile apps, web applications, or programming education. However, creating tests for programs that have the purpose of challenging human players is a daunting task for automatic test generators. Even if test generation succeeds in finding a relevant sequence of events to exercise a program, the randomized nature of games means that it may neither be possible to reproduce the exact program behavior underlying this sequence, nor to create test assertions checking if observed randomized game behavior is correct. To overcome these problems, we propose Neatest, a novel test generator based on the NeuroEvolution of Augmenting Topologies (NEAT) algorithm. Neatest systematically explores a program's statements, and creates neural networks that operate the program in order to reliably reach each statement -- that is, Neatest learns to play the game in a way to reliably cover different parts of the code. As the networks learn the actual game behavior, they can also serve as test oracles by evaluating how surprising the observed behavior of a program under test is compared to a supposedly correct version of the program. We evaluate this approach in the context of Scratch, an educational programming environment. Our empirical study on 25 non-trivial Scratch games demonstrates that our approach can successfully train neural networks that are not only far more resilient to random influences than traditional test suites consisting of static input sequences, but are also highly effective with an average mutation score of more than 65%.
翻译:在移动应用程序、网络应用程序或编程教育等许多软件工程领域,类似游戏的程序越来越受欢迎。然而,为具有挑战人类玩家目的的程序建立测试,对于自动测试发电机来说是一项艰巨的任务。即使测试生成能够成功找到一个相关的事件序列来运行一个程序,游戏随机化的性质意味着它既不可能复制这个序列背后的确切程序行为,也不可能在观察到随机游戏行为正确的情况下建立测试性判断检查。为了克服这些问题,我们提议用新颖的测试生成器Neatet,一个基于扩大地形学神经进化(NEAT)算法的新型测试生成器。Neatest系统系统地探索一个程序的声明,并创建神经网络,运行程序以可靠地达到每个语句 -- 也就是说,Neatet学会游戏游戏以可靠的方式覆盖该序列的不同部分。由于网络学习实际游戏行为,因此它们也可以作为测试或奇迹,通过对测试所观察到的程序的令人惊讶的程度如何,而不是对程序进行所谓的正确版本。我们用Scratch系统系统系统系统系统来系统化一个程序,我们在25级游戏中也无法用一个更深层次的系统化的顺序来测试。