Given a programming problem, pre-trained language models such as Codex have demonstrated the ability to generate multiple different code solutions via sampling. However, selecting a correct or best solution from those samples still remains a challenge. While an easy way to verify the correctness of a code solution is through executing test cases, producing high-quality test cases is prohibitively expensive. In this paper, we explore the use of pre-trained language models to automatically generate test cases, calling our method CodeT: Code generation with generated Tests. CodeT executes the code solutions using the generated test cases, and then chooses the best solution based on a dual execution agreement with both the generated test cases and other generated solutions. We evaluate CodeT on five different pre-trained models with both HumanEval and MBPP benchmarks. Extensive experimental results demonstrate CodeT can achieve significant, consistent, and surprising improvements over previous methods. For example, CodeT improves the pass@1 on HumanEval to 65.8%, an increase of absolute 18.8% on the code-davinci-002 model, and an absolute 20+% improvement over previous state-of-the-art results.
翻译:鉴于编程问题,Codex等经过预先培训的语言模型展示了通过取样产生多种不同的代码解决方案的能力。然而,从这些样本中选择正确或最佳解决方案仍是一个挑战。虽然通过测试案例来验证代码解决方案的正确性是一个容易的方法,但产生高质量的测试案例的费用却令人望而却步。在本文中,我们探索使用经过培训的语言模型自动生成测试案例,称我们的方法代码代码T:生成测试的代码生成。代码T利用生成的测试案例执行代码解决方案,然后根据生成的测试案例和其他生成解决方案的双重执行协议选择最佳解决方案。我们用人类经济学和MBPP基准对五种经过培训的模型进行代码T评估。广泛的实验结果显示,代码T能够取得显著、一致和惊人的改进。例如,CodT将关于HumanEval的通行证@1提高到65.8%,将代码Davinci-002模型的绝对增加18.8%,并将以前的状态结果完全改善20 ⁇ 。