使用共同代码进化和试验历史数据进行实际 Flakky 试验预测 (Practical Flaky Test Prediction using Common Code Evolution and Test History Data)

from arxiv, 12 pages, to be published in the Proceedings of the IEEE International Conference on Software Testing, Verification and Validation (ICST 2023)

Non-deterministically behaving test cases cause developers to lose trust in their regression test suites and to eventually ignore failures. Detecting flaky tests is therefore a crucial task in maintaining code quality, as it builds the necessary foundation for any form of systematic response to flakiness, such as test quarantining or automated debugging. Previous research has proposed various methods to detect flakiness, but when trying to deploy these in an industrial context, their reliance on instrumentation, test reruns, or language-specific artifacts was inhibitive. In this paper, we therefore investigate the prediction of flaky tests without such requirements on the underlying programming language, CI, build or test execution framework. Instead, we rely only on the most commonly available artifacts, namely the tests' outcomes and durations, as well as basic information about the code evolution to build predictive models capable of detecting flakiness. Furthermore, our approach does not require additional reruns, since it gathers this data from existing test executions. We trained several established classifiers on the suggested features and evaluated their performance on a large-scale industrial software system, from which we collected a data set of 100 flaky and 100 non-flaky test- and code-histories. The best model was able to achieve an F1-score of 95.5% using only 3 features: the tests' flip rates, the number of changes to source files in the last 54 days, as well as the number of changed files in the most recent pull request.

翻译：非决定性行为测试案例导致开发者对其回归测试套件失去信任,最终忽略失败。因此,检测片片状测试是维护代码质量的关键任务,因为它为任何形式的系统应对不齐度(例如测试二次测试或自动调试)奠定了必要的基础。先前的研究提出了多种方法来检测不齐度, 但是当他们试图在工业环境中部署这些测试案例时,他们依赖仪器、测试再运行或特定语言的工艺品是抑制性的。因此,在本文中,我们调查了在没有基本程序语言、 CI、构建或测试执行框架等要求的情况下对片状测试的预测。相反,我们只依靠最常用的手工艺, 即测试结果和期限, 以及关于代码演变的基本信息, 以建立能够检测不适度的预测模型。此外, 我们的方法并不需要额外的重新运行, 因为它收集了来自现有测试处决的数据。我们训练了几个已经建立的分类员, 有关推荐的功能, 并评价了他们在大型工业软件系统中的性能, 也就是我们收集了最常用的50- 最新测试速度为100 和最精确的代码的模型, 。