Software testing assures that code changes do not adversely affect existing functionality. However, a test case can be flaky, i.e., passing and failing across executions, even for the same version of the source code. Flaky test cases introduce overhead to software development as they can lead to unnecessary attempts to debug production or testing code. The state-of-the-art ML-based flaky test case predictors rely on pre-defined sets of features that are either project-specific, require access to production code, which is not always available to software test engineers. Therefore, in this paper, we propose Flakify, a black-box, language model-based predictor for flaky test cases. Flakify relies exclusively on the source code of test cases, thus not requiring to (a) access to production code (black-box), (b) rerun test cases, (c) pre-define features. To this end, we employed CodeBERT, a pre-trained language model, and fine-tuned it to predict flaky test cases using the source code of test cases. We evaluated Flakify on two publicly available datasets (FlakeFlagger and IDoFT) for flaky test cases and compared our technique with the FlakeFlagger approach using two different evaluation procedures: cross-validation and per-project validation. Flakify achieved high F1-scores on both datasets using cross-validation and per-project validation, and surpassed FlakeFlagger by 10 and 18 percentage points in terms of precision and recall, respectively, when evaluated on the FlakeFlagger dataset, thus reducing the cost bound to be wasted on unnecessarily debugging test cases and production code by the same percentages. Flakify also achieved significantly higher prediction results when used to predict test cases on new projects, suggesting better generalizability over FlakeFlagger. Our results further show that a black-box version of FlakeFlagger is not a viable option for predicting flaky test cases.
翻译:软件测试可以确保代码更改不会对现有功能产生不利影响。 但是, 测试的精确度可以是模糊的, 也就是说, 即使是对同一版本的源代码, 即使是对同一版本的源代码, 也可以在执行过程中通过或失败。 亮度测试案例会引入软件开发的间接费用, 因为它们可能导致不必要地尝试调试代码。 以 ML 为基础的最先进的软件测试案例预测器依赖于预先定义的成套功能, 这些功能要么是项目专用的, 需要访问生产代码, 而软件测试工程师并不总是可以使用这种代码。 因此, 在本文中, 我们建议 Flakif 、 黑盒、 语言模型化的预测器, 即使是对发性测试案例的源代码, 也使用源代码的源代码。 使用两个公开的 Flaki 版本, 使用不同的 Flaki 测试程序, 使用不同的 Flaukla 测试案例, 来显示我们之前的版本。 我们使用 DCBER, 一个经过预设的版本的语言模型, 并且通过测试案例的源代码来预测 。 我们用两个公开版本的 Flaki- 测试的版本, 测试程序使用不同的 Flader 和 Fla 测试程序, 显示的版本, 然后 。