The advent of pre-trained code language models (CodeLMs) has lead to significant progress in language-to-code generation. State-of-the-art approaches in this area combine CodeLM decoding with sample pruning and reranking using test cases or heuristics based on the execution results. However, it is challenging to obtain test cases for many real-world language-to-code applications, and heuristics cannot well capture the semantic features of the execution results, such as data type and value range, which often indicates the correctness of the program. In this work, we propose LEVER, a simple approach to improve language-to-code generation by learning to verify the generated programs with their execution results. Specifically, we train verifiers to determine whether a program sampled from the CodeLM is correct or not based on the natural language input, the program itself and its execution results. The sampled programs are reranked by combining the verification score with the CodeLM generation probability, and marginalizing over programs with the same execution results. On four datasets across the domains of table QA, math QA and basic Python programming, LEVER consistently improves over the base CodeLMs (4.6% to 10.9% with code-davinci-002) and achieves new state-of-the-art results on all of them.
翻译:经过预先训练的代码语言模型(CodeLM)的出现导致语言对代码生成的显著进展。 在这一领域,最先进的方法将代码LM的解码与根据执行结果对测试案例或超自然学进行抽样处理和重新排序相结合。 然而,要获得许多真实世界语言对代码应用程序的测试案例,以及超自然学无法很好地捕捉执行结果的语义特征,如数据类型和价值范围,这往往表明程序是否正确。 在这项工作中,我们建议LEWER, 一种简单的方法,通过学习校验生成的程序及其执行结果来改进语言对代码生成。 具体地说,我们培训验证员,以确定从代码LM的样本程序是否正确,是否基于自然语言输入、程序本身及其执行结果。 抽样程序由于将核查得分与代码代码CodealMM的生成概率相结合,以及将程序与相同的执行结果边缘化而重新排序。 在表格的四套数据集中, QA、数学- Q-A 和基本代码(4-MVER- MAS- bas- bas- bas- bas- bas- bas- bas- bas- bast- bas- bas- bas- bas- bas- mass- bas- bal- bas- bas- bas- bas- brod- bal- brod- mass- mass- bal- bal- mass- mass- bal- mass- mass- mass- bal- bal- bal- bal- mess- bal- bal- bal- bal- bal- bal- mess- mess- mess- mess- mess- mess- mess- mal- mass- mess- mal- mal- mal- mal- mal- mal- mal- mal- mal- mal- mal- mal- mal- mal- mal- mal- mal- mal- mal- mal- mal- mal- mal- mal- mal- mal- mal- mal- mal- mal- mal- mal- mal- mal- mal- mal- mal