Large language models (LLMs) have demonstrated an impressive ability to generate code for various programming tasks. In many instances, LLMs can generate a correct program for a task when given numerous trials. Consequently, a recent trend is to do large scale sampling of programs using a model and then filtering/ranking the programs based on the program execution on a small number of known unit tests to select one candidate solution. However, these approaches assume that the unit tests are given and assume the ability to safely execute the generated programs (which can do arbitrary dangerous operations such as file manipulations). Both of the above assumptions are impractical in real-world software development. In this paper, we propose fault-aware neural code rankers that can predict the correctness of a sampled program without executing it. The fault-aware rankers are trained to predict different kinds of execution information such as predicting the exact compile/runtime error type (e.g., an IndexError or a TypeError). We show that our fault-aware rankers can significantly increase the pass@1 accuracy of various code generation models (including Codex, GPT-Neo, GPT-J) on APPS, HumanEval and MBPP datasets.
翻译:大型语言模型(LLMS) 展示了为各种编程任务生成代码的惊人能力。 在许多情况下, LLMS 能够产生一个正确的程序程序, 当给一个任务做无数次试验时, 因此, 最近的趋势是使用模型对程序进行大规模抽样, 然后根据程序执行的少量已知单位测试对程序进行过滤/排序, 以选择一个候选解决方案。 但是, 这些方法假定单位测试是给定的, 并假定能够安全执行生成的程序( 它可以进行任意的危险操作, 如文件操作等 ) 。 上述两种假设在现实世界软件开发中都是不切实际的。 在本文中, 我们提议有错觉的神经代码排位器, 可以预测抽样程序是否正确。 有错觉的排位器受过培训, 可以预测不同种类的执行信息, 如预测精确的编译/运行错误类型( 例如, IndexError 或 TypeError ) 。 我们显示, 我们的错误排位器可以大幅提高各种代码生成模型( 包括 Codex, GPT- Neo, GPT-Js) 和 AP Human- Eval.