Large language models (LLMs) have achieved impressive performance on code generation. However, for complex programming tasks, generating the correct solution in one go becomes challenging, thus some prior works have designed program repair approaches to improve code generation performance. In this work, we propose Self-Debugging, which teaches a large language model to debug its predicted program via few-shot demonstrations. In particular, we demonstrate that Self-Debugging can teach the large language model to perform rubber duck debugging; i.e., without any feedback on the code correctness or error messages, the model is able to identify its mistakes by explaining the generated code in natural language. Self-Debugging achieves the state-of-the-art performance on several code generation benchmarks, including the Spider dataset for text-to-SQL generation, TransCoder for C++-to-Python translation, and MBPP for text-to-Python generation. On the Spider benchmark where there are no unit tests to verify the correctness of predictions, Self-Debugging with code explanation consistently improves the baseline by 2-3%, and improves the prediction accuracy on problems of the hardest label by 9%. On TransCoder and MBPP where unit tests are available, Self-Debugging improves the baseline accuracy by up to 12%. Meanwhile, by leveraging feedback messages and reusing failed predictions, Self-Debugging notably improves sample efficiency, and can match or outperform baseline models that generate more than 10x candidate programs.
翻译:大型语言模型(LLMs)在代码生成方面取得了令人瞩目的性能。但是,对于复杂的编程任务,在一次生成的过程中生成正确的解决方案变得具有挑战性,因此一些先前的研究设计了编程修复方法来提高代码生成性能。在这项工作中,我们提出了自修复技术,该技术通过少量示范教授大型语言模型调试其预测的程序。特别是,我们展示了自我调试可以教授大型语言模型执行橡皮鸭调试;即,在没有关于代码正确性或错误消息的任何反馈的情况下,模型能够通过用自然语言解释生成的代码来识别其错误。自我调试在几个代码生成基准测试中达到了最先进的性能,包括针对文本到SQL生成的Spider数据集、C++到Python翻译的TransCoder和文本到Python生成的MBPP。在Spider基准测试中,没有用于验证预测正确性的单元测试,自我调试带有代码解释始终将基线改进2-3%,并将最难的任务的预测准确性提高了9%。在TransCoder和MBPP中,单元测试可用,在这些基准测试中,自我调试将基线准确性提高了高达12%。同时,通过利用反馈信息和重用失败的预测,自我调试显着提高了样本效率,并且可以匹配或胜过生成10倍以上候选程序的基线模型。