Non-deterministically behaving test cases cause developers to lose trust in their regression test suites and to eventually ignore failures. Detecting flaky tests is therefore a crucial task in maintaining code quality, as it builds the necessary foundation for any form of systematic response to flakiness, such as test quarantining or automated debugging. Previous research has proposed various methods to detect flakiness, but when trying to deploy these in an industrial context, their reliance on instrumentation, test reruns, or language-specific artifacts was inhibitive. In this paper, we therefore investigate the prediction of flaky tests without such requirements on the underlying programming language, CI, build or test execution framework. Instead, we rely only on the most commonly available artifacts, namely the tests' outcomes and durations, as well as basic information about the code evolution to build predictive models capable of detecting flakiness. Furthermore, our approach does not require additional reruns, since it gathers this data from existing test executions. We trained several established classifiers on the suggested features and evaluated their performance on a large-scale industrial software system, from which we collected a data set of 100 flaky and 100 non-flaky test- and code-histories. The best model was able to achieve an F1-score of 95.5% using only 3 features: the tests' flip rates, the number of changes to source files in the last 54 days, as well as the number of changed files in the most recent pull request.
翻译:测试用例的非确定性行为会导致开发人员对回归测试套件失去信任,最终忽略失败。因此,检测Flaky测试是保持代码质量的关键任务,因为它为任何形式的系统性Flakiness响应,例如测试隔离或自动调试,构建了必要的基础。先前的研究提出了各种方法来检测Flakiness,但在尝试在工业环境中部署这些方法时,它们对仪器、测试重新运行或特定于语言的工具的依赖性是抑制性的。因此,在本文中,我们研究了在不需要对底层编程语言、CI、构建或测试执行框架提出要求的情况下预测Flaky测试的方法。相反,我们仅依赖于最常见的工件,即测试的成果和持续时间,以及有关代码进化的基本信息来构建能够检测Flakiness的预测模型。此外,我们的方法不需要额外的重新运行,因为它从现有的测试执行中收集此数据。我们在建议的特征上训练了几个已建立的分类器,并评估了它们在一个大型工业软件系统上的性能,从中收集了一个包含100个Flaky和100个非Flak