To support software developers in finding and fixing software bugs, several automated program repair techniques have been introduced. Given a test suite, standard methods usually either synthesize a repair, or navigate a search space of software edits to find test-suite passing variants. Recent program repair methods are based on deep learning approaches. One of these novel methods, which is not primarily intended for automated program repair, but is still suitable for it, is ChatGPT. The bug fixing performance of ChatGPT, however, is so far unclear. Therefore, in this paper we evaluate ChatGPT on the standard bug fixing benchmark set, QuixBugs, and compare the performance with the results of several other approaches reported in the literature. We find that ChatGPT's bug fixing performance is competitive to the common deep learning approaches CoCoNut and Codex and notably better than the results reported for the standard program repair approaches. In contrast to previous approaches, ChatGPT offers a dialogue system through which further information, e.g., the expected output for a certain input or an observed error message, can be entered. By providing such hints to ChatGPT, its success rate can be further increased, fixing 31 out of 40 bugs, outperforming state-of-the-art.
翻译:为支持软件开发者查找和修复软件错误,引入了几种自动程序维修技术。 在测试套件中,标准方法通常或者合成修理,或者浏览软件编辑搜索空间以查找适合测试的替代版本。最近的程序维修方法基于深层次的学习方法。这些创新方法之一,主要不是用于自动程序维修,但仍然适合它。但是,查特GPT的错误修复性能目前还不清楚。因此,在本文中,我们评估查特GPT对标准错误确定基准集(QuixBugs)的预期输出,并将业绩与文献中报告的其他几种方法的结果进行比较。我们发现,查特GPT的错误确定性能与共同的深层次学习方法(CooNut和codx)相比具有竞争力,而且明显比标准程序修理方法所报告的结果要好。与以往的方法相比,查特GPT提供了一种对话系统,可以通过该系统输入进一步的信息,例如,某些输入或观察到的错误信息的预期产出,可以输入。通过向查特GPT提供这种提示,其成功率可以进一步提高。