测试套件有效性衡量评价:我们知道什么,我们应做什么? (Test suite effectiveness metric evaluation: what do we know and what should we do?)

Comparing test suite effectiveness metrics has always been a research hotspot. However, prior studies have different conclusions or even contradict each other for comparing different test suite effectiveness metrics. The problem we found most troubling to our community is that researchers tend to oversimplify the description of the ground truth they use. For example, a common expression is that "we studied the correlation between real faults and the metric to evaluate (MTE)". However, the meaning of "real faults" is not clear-cut. As a result, there is a need to scrutinize the meaning of "real faults". Without this, it will be half-knowledgeable with the conclusions. To tackle this challenge, we propose a framework ASSENT (evAluating teSt Suite EffectiveNess meTrics) to guide the follow-up research. In nature, ASSENT consists of three fundamental components: ground truth, benchmark test suites, and agreement indicator. First, materialize the ground truth for determining the real order in effectiveness among test suites. Second, generate a set of benchmark test suites and derive their ground truth order in effectiveness. Third, for the benchmark test suites, generate the MTE order in effectiveness by the metric to evaluate (MTE). Finally, calculate the agreement indicator between the two orders. Under ASSENT, we are able to compare the accuracy of different test suite effectiveness metrics. We apply ASSENT to evaluate representative test suite effectiveness metrics, including mutation score metrics and code coverage metrics. Our results show that, based on the real faults, mutation score and subsuming mutation score are the best metrics to quantify test suite effectiveness. Meanwhile, by using mutants instead of real faults, MTEs will be overestimated by more than 20% in values.

翻译：比较测试套件有效性的衡量标准总是一个研究热点。然而,先前的研究在比较不同的测试套件有效性衡量标准方面有不同的结论,甚至相互矛盾。我们发现,我们社区最令人不安的问题是,研究人员往往过度简化他们所使用的地面真相描述。例如,一个常见的表述是,“我们研究了实际缺陷与评估(MTE)的衡量标准之间的相互关系”。然而,“真实缺陷”的含义并不明确。因此,有必要仔细研究“真实缺陷”的含义。因此,必须仔细研究“真实缺陷”的含义。如果没有这一点,它将半可理解与结论相矛盾。为了应对这一挑战,我们建议一个框架 Assleent(evalation Test Test Suite NefficeNess merics) 来指导后续研究。例如,“我们研究了实际缺陷与衡量标准(MTE)之间的关联性。首先,通过建立一套基准测试套件,然后用一套测试套件,然后用一套测试套件来评估其真实的准确性。最后,在测试套件中,我们测试套件中,通过两个测试套件的测试比标值的比标值,我们的标准标准的比值将显示我们的标准比值的比值。