Flaky tests are tests that pass and fail on different executions of the same version of a program under test. They waste valuable developer time by making developers investigate false alerts (flaky test failures). To deal with this problem, many prediction methods that identify flaky tests have been proposed. While promising, the actual utility of these methods remains unclear since they have not been evaluated within a continuous integration (CI) process. In particular, it remains unclear what is the impact of missed faults, i.e., the consideration of fault-triggering test failures as flaky, at different CI cycles. To fill this gap, we apply state-of-the-art flakiness prediction methods at the Chromium CI and check their performance. Perhaps surprisingly, we find that, despite the high precision (99.2%) of the methods, their application leads to numerous faults missed, approximately 76.2% of all regression faults. To explain this result, we analyse the fault-triggering failures and show that flaky tests have a strong fault-revealing capability, i.e., they reveal more than 1/3 of all regression faults, indicating an inherent limitation of all methods focusing on identifying flaky tests, instead of flaky test failures. Going a step further, we build failure-focused prediction methods and optimize them by considering new features. Interestingly, we find that these methods perform better than the test-focused ones, with an MCC increasing from 0.20 to 0.42. Overall, our findings imply that on the one hand future research should focus on predicting flaky test failures instead of flaky tests and the need for adopting more thorough experimental methodologies when evaluating flakiness prediction methods, on the other.
翻译:赤裸裸的测试是同一版本程序的不同测试结果的通过和失败的测试。 它们浪费宝贵的开发者时间, 让开发者调查假警报( 发光测试失败 ) 。 为了解决这个问题, 提出了许多预测方法 。 虽然很有希望, 但这些方法的实际效用仍然不清楚, 因为没有在连续整合( CI) 进程中对其进行评估。 特别是, 仍然不清楚错误的错漏影响是什么, 也就是说, 将触发错误的测试失败视为在不同的 CIS 周期中的不透明。 为了填补这一缺口, 我们在Chromium CI 中应用了最先进的失灵预测方法。 我们发现, 尽管方法非常精确( 99.2% ), 但这些方法的实际效用仍然不清楚。 解释这个结果, 我们分析错误触发的错误, 并显示, 最精确的测试方法比清晰的更精确的准确的准确度要强, 也就是说, 我们揭示了所有回归失败的三分之一以上的失灵预测方法, 显示, 更精确的精确的精确度测试方法需要更精确的测试方法 。