Automated tools for solving GitHub issues are receiving significant attention by both researchers and practitioners, e.g., in the form of foundation models and LLM-based agents prompted with issues. A crucial step toward successfully solving an issue is creating a test case that accurately reproduces the issue. Such a test case can guide the search for an appropriate patch and help validate whether the patch matches the issue's intent. However, existing techniques for issue reproduction show only moderate success. This paper presents Issue2Test, an LLM-based technique for automatically generating a reproducing test case for a given issue report. Unlike automated regression test generators, which aim at creating passing tests, our approach aims at a test that fails, and that fails specifically for the reason described in the issue. To this end, Issue2Test performs three steps: (1) understand the issue and gather context (e.g., related files and project-specific guidelines) relevant for reproducing it; (2) generate a candidate test case; and (3) iteratively refine the test case based on compilation and runtime feedback until it fails and the failure aligns with the problem described in the issue. We evaluate Issue2Test on the SWT-bench-lite dataset, where it successfully reproduces 32.9% of the issues, achieving a 16.3% relative improvement over the best existing technique. Our evaluation also shows that Issue2Test reproduces 20 issues that four prior techniques fail to address, contributing a total of 60.4% of all issues reproduced by these tools. We envision our approach to contribute to enhancing the overall progress in the important task of automatically solving GitHub issues.
翻译:用于解决GitHub问题的自动化工具正受到研究人员和实践者的广泛关注,例如以基础模型和基于LLM的智能体形式,通过问题提示来运作。成功解决问题的关键步骤是创建一个能准确复现问题的测试用例。此类测试用例可指导寻找合适的补丁,并帮助验证补丁是否符合问题意图。然而,现有的问题复现技术仅表现出中等程度的成功率。本文提出Issue2Test,一种基于LLM的技术,用于根据给定问题报告自动生成可复现的测试用例。与旨在创建通过测试的自动化回归测试生成器不同,我们的方法旨在生成一个会失败的测试,且失败原因需与问题描述一致。为此,Issue2Test执行三个步骤:(1)理解问题并收集与复现相关的上下文(例如相关文件和项目特定指南);(2)生成候选测试用例;(3)基于编译和运行时反馈迭代优化测试用例,直至其失败且失败原因与问题描述的问题一致。我们在SWT-bench-lite数据集上评估Issue2Test,其成功复现了32.9%的问题,相对于现有最佳技术实现了16.3%的相对提升。评估还显示,Issue2Test复现了20个先前四种技术未能处理的问题,占这些工具复现问题总数的60.4%。我们期望该方法能推动自动解决GitHub问题这一重要任务的整体进展。