Researchers and practitioners have designed and implemented various automated test case generators to support effective software testing. Such generators exist for various languages (e.g., Java, C#, or Python) and for various platforms (e.g., desktop, web, or mobile applications). Such generators exhibit varying effectiveness and efficiency, depending on the testing goals they aim to satisfy (e.g., unit-testing of libraries vs. system-testing of entire applications) and the underlying techniques they implement. In this context, practitioners need to be able to compare different generators to identify the most suited one for their requirements, while researchers seek to identify future research directions. This can be achieved through the systematic execution of large-scale evaluations of different generators. However, the execution of such empirical evaluations is not trivial and requires a substantial effort to collect benchmarks, setup the evaluation infrastructure, and collect and analyse the results. In this paper, we present our JUnit Generation benchmarking infrastructure (JUGE) supporting generators (e.g., search-based, random-based, symbolic execution, etc.) seeking to automate the production of unit tests for various purposes (e.g., validation, regression testing, fault localization, etc.). The primary goal is to reduce the overall effort, ease the comparison of several generators, and enhance the knowledge transfer between academia and industry by standardizing the evaluation and comparison process. Since 2013, eight editions of a unit testing tool competition, co-located with the Search-Based Software Testing Workshop, have taken place and used and updated JUGE. As a result, an increasing amount of tools (over ten) from both academia and industry have been evaluated on JUGE, matured over the years, and allowed the identification of future research directions.
翻译:研究人员和从业人员设计并实施了各种自动测试案例生成器,以支持有效的软件测试,这些生成器用于各种语言(如爪哇、C#、或Python)和各种平台(如台式、网络或移动应用程序),这些生成器的效能和效率各不相同,取决于他们要达到的测试目标(如图书馆的单位测试相对于整个应用程序的系统测试)及其实施的基本技术。在这方面,从业人员需要能够对不同的生成器进行比较,以确定最适合其要求的生成器,而研究人员则寻求确定未来的研究方向。这可以通过对不同生成器进行系统的大规模评价来实现。然而,实施这种经验性评价并非微不足道,需要做出大量努力,以收集基准,建立评价基础设施,收集和分析结果。在本文件中,我们介绍了我们的JUPERD基准基础设施(JUGEGE),支持发电机(如基于搜索、基于随机、象征性执行等),以便确定最适合其要求的生成者,而研究人员则寻求确定未来的研究方向。这可以通过系统对各种目的进行自动化的车间测试(例如,验证、检验、检验、检验、检验、检验、检验、检验、检验、检验、检验、检验、检验、检验、检验、检验、检验、检验、检验、检验、检验、检验、检验、检验、检验、检验、检验、检验、检验、检验、检验、检验、检验、检验、检验、检验、检验、检验、检验、检验、检验、检验、检验、检验、检验、检验、检验、检验、检验、检验、检验、检验、检验、检验、检验、检验、评估、评估、评估、评估、评估、评估、评估、评估、评估、评估、评估、评估、评估、评估、评估、评估、评估、评估、评估、评估、评估、评估、评估、评估、评估、评估、评估、评估、评估、评估、评估、评估、评估、评估、评估、评估、评估、评估、评估、评估、评估、评估、评估、评估、评估、评估、评估、评估、评估、评估、评估、评估、评估、评估、评估、评估、评估、评估、评估、评估、评估、评估、评估、评估、评估、评估、评估、评估、评估、评估、评估、评估、评估、评估、评估、评估、评估、评估、评估、评估、评估、评估、评估、评估、评估、评估、评估、评估