Automated debugging techniques have the potential to reduce developer effort in debugging, and have matured enough to be adopted by industry. However, one critical issue with existing techniques is that, while developers want rationales for the provided automatic debugging results, existing techniques are ill-suited to provide them, as their deduction process differs significantly from that of human developers. Inspired by the way developers interact with code when debugging, we propose Automated Scientific Debugging (AutoSD), a technique that given buggy code and a bug-revealing test, prompts large language models to automatically generate hypotheses, uses debuggers to actively interact with buggy code, and thus automatically reach conclusions prior to patch generation. By aligning the reasoning of automated debugging more closely with that of human developers, we aim to produce intelligible explanations of how a specific patch has been generated, with the hope that the explanation will lead to more efficient and accurate developer decisions. Our empirical analysis on three program repair benchmarks shows that AutoSD performs competitively with other program repair baselines, and that it can indicate when it is confident in its results. Furthermore, we perform a human study with 20 participants, including six professional developers, to evaluate the utility of explanations from AutoSD. Participants with access to explanations could judge patch correctness in roughly the same time as those without, but their accuracy improved for five out of six real-world bugs studied: 70% of participants answered that they wanted explanations when using repair tools, while 55% answered that they were satisfied with the Scientific Debugging presentation.
翻译:自动化调试技术可以减少开发人员在调试中的工作量,已经成熟并被行业采用。然而,现有技术存在一个关键问题:尽管开发人员希望了解提供的自动调试结果的原因,但现有技术并不适合提供这些原因,因为它们的演绎过程与人类开发人员有很大不同。受开发人员调试代码的方式的启发,我们提出了自动化科学调试(AutoSD)技术,该技术在给定有缺陷的代码和一个揭示缺陷的测试的情况下,提示大型语言模型自动生成假设,使用调试器主动与有缺陷的代码交互,并在生成补丁之前自动得出结论。通过更接近人类开发人员的推理方式,我们旨在提供有意义的解释,说明特定补丁是如何生成的,希望这些解释可以带来更高效准确的开发人员决策。我们在三个程序修复基准测试上的实证分析表明,AutoSD的性能与其他程序修复基线相当,并且它可以指示其结果的自信度。此外,我们进行了一项人类研究,其中包括六名专业开发人员共计20名参与者,以评估来自AutoSD的解释的效用。有解释的参与者可以在与没有解释参与者大致相同的时间内判断补丁的正确性,但其在五个真实世界中的六个bug中的准确性得到了提高: 70%的参与者回答他们在使用修复工具时需要解释,而55%的参与者回答他们对科学调试的呈现表示满意。