Flaky tests are tests that yield different outcomes when run on the same version of a program. This non-deterministic behaviour plagues continuous integration with false signals, wasting developers' time and reducing their trust in test suites. Studies highlighted the importance of keeping tests flakiness-free. Recently, the research community has been pushing forward the detection of flaky tests by suggesting many static and dynamic approaches. While promising, those approaches mainly focus on classifying tests as flaky or not and, even when high performances are reported, it remains challenging to understand the cause of flakiness. This part is crucial for researchers and developers that aim to fix it. To help with the comprehension of a given flaky test, we propose FlakyCat, the first approach for classifying flaky tests based on their root cause category. FlakyCat relies on CodeBERT for code representation and leverages a Siamese network-based Few-Shot learning method to train a multi-class classifier with few data. We train and evaluate FlakyCat on a set of 343 flaky tests collected from open-source Java projects. Our evaluation shows that FlakyCat categorises flaky tests accurately, with a weighted F1 score of 70%. Furthermore, we investigate the performance of our approach for each category, revealing that Async waits, Unordered collections and Time-related flaky tests are accurately classified, while Concurrency-related flaky tests are more challenging to predict. Finally, to facilitate the comprehension of FlakyCat's predictions, we present a new technique for CodeBERT-based model interpretability that highlights code statements influencing the categorization.
翻译: FlakyCat, 研究界通过建议许多静态和动态的方法, 推进微量测试的检测。 虽然这些方法很有希望, 但主要侧重于将测试分为片状或非片状, 甚至在报告高性能时, 仍然难以理解不适的原因。 这部分对于研究者和开发者来说至关重要, 帮助他们理解一个错误的测试, 浪费开发者的时间并降低他们对测试套件的信任度。 研究界强调保持无不适性测试的重要性。 最近, 研究界一直通过建议许多静态和动态的方法, 推进对微量测试的检测。 虽然这些方法主要侧重于将测试归类为片状或非片状测试, 即使在报告高性能时, 也仍然难以理解不适的原因。 这部分对于旨在修正的研究人员和开发者来说至关重要。 为了帮助理解给定的模糊性测试, 我们提议FlakyCat, 这是根据根基Ctreloral的第一次分析方法。 我们的评估显示, 精确的精确性定值测试, 精确地, 精确地, 我们的平价级测试, 最后的计算, 我们的平价级的计算, 最后的计算。