This paper explores the limits of the current generation of large language models for program synthesis in general purpose programming languages. We evaluate a collection of such models (with between 244M and 137B parameters) on two new benchmarks, MBPP and MathQA-Python, in both the few-shot and fine-tuning regimes. Our benchmarks are designed to measure the ability of these models to synthesize short Python programs from natural language descriptions. The Mostly Basic Programming Problems (MBPP) dataset contains 974 programming tasks, designed to be solvable by entry-level programmers. The MathQA-Python dataset, a Python version of the MathQA benchmark, contains 23914 problems that evaluate the ability of the models to synthesize code from more complex text. On both datasets, we find that synthesis performance scales log-linearly with model size. Our largest models, even without finetuning on a code dataset, can synthesize solutions to 59.6 percent of the problems from MBPP using few-shot learning with a well-designed prompt. Fine-tuning on a held-out portion of the dataset improves performance by about 10 percentage points across most model sizes. On the MathQA-Python dataset, the largest fine-tuned model achieves 83.8 percent accuracy. Going further, we study the model's ability to engage in dialog about code, incorporating human feedback to improve its solutions. We find that natural language feedback from a human halves the error rate compared to the model's initial prediction. Additionally, we conduct an error analysis to shed light on where these models fall short and what types of programs are most difficult to generate. Finally, we explore the semantic grounding of these models by fine-tuning them to predict the results of program execution. We find that even our best models are generally unable to predict the output of a program given a specific input.
翻译:本文探索当前生成的大型语言模型用于一般目的编程语言的方案合成的大规模语言模型的局限性。 我们评估了在两个新基准( MBPP 和 MathQA-Python ) 上的大型语言模型( 范围介于 244M 和 137B 参数之间), 包括两个微小和微调制度。 我们的基准旨在测量这些模型从自然语言描述中合成短 Python 程序的能力。 Most Basic Pap 数据集包含974 个编程任务, 设计为入级语言程序员可以解析。 MathQA 数据集数据集的 Python 数据集的 Paython 数据集, Python 数据库版本的这种模型集集集集集集, 将模型的精细化到最精确性能分析。 在这两个数据集集中,我们发现最精确性能的模型到最精确性能, 我们最精确的模型到最精确的模型到最精确的输出。