Large language models (LLMs) have demonstrated an impressive ability to generate code for various programming tasks. In many instances, LLMs can generate a correct program for a task when given numerous trials. Consequently, a recent trend is to do large scale sampling of programs using a model and then filtering/ranking the programs based on the program execution on a small number of known unit tests to select one candidate solution. However, these approaches assume that the unit tests are given and assume the ability to safely execute the generated programs (which can do arbitrary dangerous operations such as file manipulations). Both of the above assumptions are impractical in real-world software development. In this paper, we propose CodeRanker, a neural ranker that can predict the correctness of a sampled program without executing it. Our CodeRanker is fault-aware i.e., it is trained to predict different kinds of execution information such as predicting the exact compile/runtime error type (e.g., an IndexError or a TypeError). We show that CodeRanker can significantly increase the pass@1 accuracy of various code generation models (including Codex, GPT-Neo, GPT-J) on APPS, HumanEval and MBPP datasets.
翻译:大型语言模型(LLMS) 展示了为各种编程任务生成代码的惊人能力。 在许多情况下, LLMS 能够产生一个正确的程序程序, 当给一个任务做无数次试验时, 因此, 最近的趋势是使用一个模型对程序进行大规模取样, 然后根据程序执行的少量已知单位测试对程序进行过滤/排序, 以选择一个候选解决方案。 但是, 这些方法假定单位测试是给定的, 并假定能够安全执行生成的程序( 它可以进行任意的危险操作, 如文件操作操作等 ) 。 上述两种假设在现实世界软件开发中是不切实际的。 我们在此文件中提议 CodRanker, 是一个神经级排名器, 可以预测抽样程序是否正确, 而不用执行它。 我们的 CodeRanker是错觉察觉, 也就是说, 它受过训练, 可以预测不同类型的执行信息, 例如预测准确的编程/运行错误类型( 例如, 索引错误或类型错误) 。 我们表明, CodeRanker 能够大大提高各种代码生成模型( 包括 Codex, GPT-NE) 和 GPT- APPT) 的数据。