Automation of test oracles is one of the most challenging facets of software testing, but remains comparatively less addressed compared to automated test input generation. Test oracles rely on a ground-truth that can distinguish between the correct and buggy behavior to determine whether a test fails (detects a bug) or passes. What makes the oracle problem challenging and undecidable is the assumption that the ground-truth should know the exact expected, correct, or buggy behavior. However, we argue that one can still build an accurate oracle without knowing the exact correct or buggy behavior, but how these two might differ. This paper presents SEER, a learning-based approach that in the absence of test assertions or other types of oracle, can determine whether a unit test passes or fails on a given method under test (MUT). To build the ground-truth, SEER jointly embeds unit tests and the implementation of MUTs into a unified vector space, in such a way that the neural representation of tests are similar to that of MUTs they pass on them, but dissimilar to MUTs they fail on them. The classifier built on top of this vector representation serves as the oracle to generate "fail" labels, when test inputs detect a bug in MUT or "pass" labels, otherwise. Our extensive experiments on applying SEER to more than 5K unit tests from a diverse set of open-source Java projects show that the produced oracle is (1) effective in predicting the fail or pass labels, achieving an overall accuracy, precision, recall, and F1 measure of 93%, 86%, 94%, and 90%, (2) generalizable, predicting the labels for the unit test of projects that were not in training or validation set with negligible performance drop, and (3) efficient, detecting the existence of bugs in only 6.5 milliseconds on average.
翻译:测试预言的自动化是软件测试中最具挑战性的方面之一,但与自动测试输入生成相比仍然被相对较少地解决。测试预言依赖于一个能够区分正确和错误行为以确定测试是否失败(检测到错误)或通过的基本事实。使预言问题具有挑战性和不可决定性的假设是,基本事实应该知道确切的预期正确或错误的行为。然而,我们认为,即使不知道确切的预期正确或错误行为,也可以构建准确的预言,但是这两个行为可能不同。本文介绍了SEER,一种基于学习的方法,在没有测试断言或其他类型的预言的情况下,可以确定给定测试方法下的单元测试是否通过。为了构建基本事实,SEER将单元测试和MUT的实现同时嵌入到一个统一的向量空间中,这样测试的神经表示就类似于它们通过的MUT,但与它们失败的MUT不同。在此向量表示之上构建的分类器作为预言,用于生成“失败”标签,当测试输入检测到MUT中的错误时,或“通过”标签,反之亦然。我们对超过5K个来自不同开源Java项目的单元测试应用SEER的大量实验表明,产生的预言具有以下特点:(1)在预测失败或通过标签方面有效,在总体精度、精确度、召回率和F1度量方面达到93%,86%,94%和90%;(2)具有泛化性,在未在训练或验证集中的项目的单元测试中预测标签,性能下降可以忽略不计;(3)高效,在平均6.5毫秒内检测存在的错误。