Software testing is still a manual process in many industries, despite the recent improvements in automated testing techniques. As a result, test cases are often specified in natural language by different employees and many redundant test cases might exist in the test suite. This increases the (already high) cost of test execution. Manually identifying similar test cases is a time-consuming and error-prone task. Therefore, in this paper, we propose an unsupervised approach to identify similar test cases. Our approach uses a combination of text embedding, text similarity and clustering techniques to identify similar test cases. We evaluate five different text embedding techniques, two text similarity metrics, and two clustering techniques to cluster similar test steps and four techniques to identify similar test cases from the test step clusters. Through an evaluation in an industrial setting, we showed that our approach achieves a high performance to cluster test steps (an F-score of 87.39%) and identify similar test cases (an F-score of 83.47%). Furthermore, a validation with developers indicates several different practical usages of our approach (such as identifying redundant and legacy test cases), which help to reduce the testing manual effort and time.
翻译:尽管自动化测试技术最近有所改善,但在许多行业,软件测试仍是一个人工过程。结果,不同雇员往往用自然语言具体说明测试案例,测试套件中可能存在许多冗余测试案例。这增加了(已经很高的)测试执行成本。手工确定类似测试案例是一项耗时和容易出错的任务。因此,我们在本文件中建议采用不受监督的方法来识别类似的测试案例。我们的方法是结合文本嵌入、文本相似和集群技术来识别类似的测试案例。我们评估了五种不同的文本嵌入技术、两种相似文本度量度和两种组合技术,以组合类似的测试步骤和四种技术,以找出测试组群中的类似测试案例。我们通过在工业环境中进行的评估,表明我们的方法在集群测试步骤(87.39%的F-芯)和类似的测试案例(83.47%的F-芯)上取得了很高的性能。此外,我们与开发者进行的验证表明,我们的方法有几种不同的实际用途(例如确定冗余和遗留试验案例),有助于减少测试手工努力和时间。