There is a growing body of research indicating the potential of machine learning to tackle complex software testing challenges. One such challenge pertains to continuous integration testing, which is highly time-constrained, and generates a large amount of data coming from iterative code commits and test runs. In such a setting, we can use plentiful test data for training machine learning predictors to identify test cases able to speed up the detection of regression bugs introduced during code integration. However, different machine learning models can have different fault prediction performance depending on the context and the parameters of continuous integration testing, for example variable time budget available for continuous integration cycles, or the size of test execution history used for learning to prioritize failing test cases. Existing studies on test case prioritization rarely study both of these factors, which are essential for the continuous integration practice. In this study we perform a comprehensive comparison of the fault prediction performance of machine learning approaches that have shown the best performance on test case prioritization tasks in the literature. We evaluate the accuracy of the classifiers in predicting fault-detecting tests for different values of the continuous integration time budget and with different length of test history used for training the classifiers. In evaluation, we use real-world industrial datasets from a continuous integration practice. The results show that different machine learning models have different performance for different size of test history used for model training and for different time budget available for test case execution. Our results imply that machine learning approaches for test prioritization in continuous integration testing should be carefully configured to achieve optimal performance.
翻译:越来越多的研究表明,机器学习有可能解决复杂的软件测试挑战,其中一项挑战涉及持续整合测试,这种测试高度受时间限制,并产生大量来自迭代代码承诺和测试运行的数据。在这种环境下,我们可以使用大量测试数据来培训机器学习预测器,以培训机器学习预测器,找出能够加速检测代码整合期间引入的回归错误的测试案例。然而,不同的机器学习模型可能根据连续整合测试的背景和参数,例如持续整合周期的可变时间预算,或用于学习确定失败测试案例优先次序的测试执行历史大小,产生不同的错误预测性能。关于测试案件优先次序的现有研究很少研究这两个因素,而这些因素对于持续整合做法至关重要。在这项研究中,我们可以对机器学习方法的错误预测性能进行全面比较,这些方法在测试案例中显示测试案例的优先度任务的最佳性能。我们评价分类者在预测持续整合时间预算的不同值和用于培训分类师的连续测试历史年限长度方面是否准确性。在评估中,我们使用不同的业绩测试模型,我们使用不同的测试结果,用于进行不同的历史测试。