Many automated test generation techniques have been developed to aid developers with writing tests. To facilitate full automation, most existing techniques aim to either increase coverage, or generate exploratory inputs. However, existing test generation techniques largely fall short of achieving more semantic objectives, such as generating tests to reproduce a given bug report. Reproducing bugs is nonetheless important, as our empirical study shows that the number of tests added in open source repositories due to issues was about 28% of the corresponding project test suite size. Meanwhile, due to the difficulties of transforming the expected program semantics in bug reports into test oracles, existing failure reproduction techniques tend to deal exclusively with program crashes, a small subset of all bug reports. To automate test generation from general bug reports, we propose LIBRO, a framework that uses Large Language Models (LLMs), which have been shown to be capable of performing code-related tasks. Since LLMs themselves cannot execute the target buggy code, we focus on post-processing steps that help us discern when LLMs are effective, and rank the produced tests according to their validity. Our evaluation of LIBRO shows that, on the widely studied Defects4J benchmark, LIBRO can generate failure reproducing test cases for 33% of all studied cases (251 out of 750), while suggesting a bug reproducing test in first place for 149 bugs. To mitigate data contamination, we also evaluate LIBRO against 31 bug reports submitted after the collection of the LLM training data terminated: LIBRO produces bug reproducing tests for 32% of the studied bug reports. Overall, our results show LIBRO has the potential to significantly enhance developer efficiency by automatically generating tests from bug reports.
翻译:已经开发了许多自动测试生成技术来帮助开发者进行写作测试。 为了便利完全自动化, 多数现有技术都旨在增加覆盖范围或产生探索性投入。 但是, 现有的测试生成技术基本上没有达到更多的语义目标, 例如生成测试来复制一个错误报告。 生成错误仍然很重要, 我们的经验研究表明, 开放源库中由于问题而添加的测试数量约为项目测试套件尺寸的28%。 与此同时, 由于难以将错误报告中的预期程序语义转换成测试或触控器, 现有的故障复制技术往往只处理程序崩溃, 也就是所有错误报告的一小部分。 但是, 现有的测试生成技术基本上没有达到更多的语义目标, 例如, 我们建议使用大语言模型( LIB ) 来自动生成自动生成。 由于 LIB 本身无法执行目标错误代码, 我们专注于后处理步骤, 帮助我们在 LIMM 有效时发现 LIMS, 并根据测试的有效性排列出测试结果 。 我们对 LIB 的评估显示, 在广泛研究错误报告时, 我们研究的 DefectM4, 自动地 数据库 测试报告 将 复制了 RILIB 。