Background: Test suites are frequently used to quantify relevant software attributes, such as quality or productivity. Problem: We have detected that the same response variable, measured using different test suites, yields different experiment results. Aims: Assess to which extent differences in test case construction influence measurement accuracy and experimental outcomes. Method: Two industry experiments have been measured using two different test suites, one generated using an ad-hoc method and another using equivalence partitioning. The accuracy of the measures has been studied using standard procedures, such as ISO 5725, Bland-Altman and Interclass Correlation Coefficients. Results: There are differences in the values of the response variables up to +-60%, depending on the test suite (ad-hoc vs. equivalence partitioning) used. Conclusions: The disclosure of datasets and analysis code is insufficient to ensure the reproducibility of SE experiments. Experimenters should disclose all experimental materials needed to perform independent measurement and re-analysis.
翻译:背景:测试套件经常用于量化相关软件属性,例如质量或生产率。问题:我们检测到了相同的响应变量,用不同的测试套件测量,得出不同的实验结果。目标:评估测试案例构造的差异在多大程度上影响测量准确性和实验结果。方法:两个行业实验用两个不同的测试套件进行了测量,一个是使用临时加热方法产生的,另一个是使用等量分隔法进行的。措施的准确性已经使用标准程序进行了研究,例如ISO 5725、Bland-Altman和不同等级的相互交错系数。结果:根据使用的测试套件(ad-hoc相对于等量分隔法),答复变量的数值在+-60%上存在差异。结论:公布数据集和分析代码不足以确保SE实验的可再生性。实验者应当披露进行独立测量和再分析所需的所有实验材料。