Context: Differential testing is a useful approach that uses different implementations of the same algorithms and compares the results for software testing. In recent years, this approach was successfully used for test campaigns of deep learning frameworks. Objective: There is little knowledge on the application of differential testing beyond deep learning. Within this article, we want to close this gap for classification algorithms. Method: We conduct a case study using Scikit-learn, Weka, Spark MLlib, and Caret in which we identify the potential of differential testing by considering which algorithms are available in multiple frameworks, the feasibility by identifying pairs of algorithms that should exhibit the same behavior, and the effectiveness by executing tests for the identified pairs and analyzing the deviations. Results: While we found a large potential for popular algorithms, the feasibility seems limited because often it is not possible to determine configurations that are the same in other frameworks. The execution of the feasible tests revealed that there is a large amount of deviations for the scores and classes. Only a lenient approach based on statistical significance of classes does not lead to a huge amount of test failures. Conclusions: The potential of differential testing beyond deep learning seems limited for research into the quality of machine learning libraries. Practitioners may still use the approach if they have deep knowledge about implementations, especially if a coarse oracle that only considers significant differences of classes is sufficient.
翻译:不同背景:不同测试是一种有用的方法,它使用不同的算法的不同实施方式,对软件测试结果进行比较。近年来,这种方法成功地用于深层次学习框架的测试运动。目标:除了深层学习之外,关于应用差异测试的知识很少。在本条款中,我们希望缩小分类算法的这一差距。方法:我们用Scikit-learn、Weka、Spark MLlib和Caret进行案例研究,我们通过考虑多种框架中哪些算法可以提供差异测试的可能性,通过确定应表现出相同行为的对应算法的可行性,以及通过对已识别的对配法进行测试和分析偏差的有效性。结果:虽然我们发现流行算法的潜力很大,但可行性似乎有限,因为往往无法确定其他框架中相同的配置。进行可行的测试后发现,分数和班级存在很大的偏差。只有基于各班级统计重要性的宽松方法,才能导致大量测试失败。结论:差异测试可能超出深层次的测试方法,如果用于深层次的质量研究,则可能只有深度的系统化方法,只有深层次的系统才能学习。