Large language models have demonstrated the ability to condition on and generate both natural language and programming language text. Such models open up the possibility of multi-language code generation: could code generation models generalize knowledge from one language to another? Although contemporary code generation models can generate semantically correct Python code, little is known about their abilities with other languages. We facilitate the exploration of this topic by proposing MultiPL-E, the first multi-language parallel benchmark for natural-language-to-code-generation. MultiPL-E extends the HumanEval benchmark (Chen et al, 2021) to support 18 more programming languages, encompassing a range of programming paradigms and popularity. We evaluate two state-of-the-art code generation models on MultiPL-E: Codex and InCoder. We find that on several languages, Codex matches and even exceeds its performance on Python. The range of programming languages represented in MultiPL-E allow us to explore the impact of language frequency and language features on model performance. Finally, the MultiPL-E approach of compiling code generation benchmarks to new programming languages is both scalable and extensible. We describe a general approach for easily adding support for new benchmarks and languages to MultiPL-E.
翻译:大型语言模型已经证明有能力对自然语言和编程语言文本进行条件化和生成,这些模型开启了多语言代码生成的可能性:代码生成模型能够将知识从一种语言推广到另一种语言?虽然当代代码生成模型能够生成精密正确的 Python 代码,但是他们与其他语言相比的能力却鲜为人知。我们通过提出多种语言与代码生成的第一个多语言平行基准,即多语言与代码生成的多语言平行基准,为探索这一专题的探索提供了便利。多语言扩展了人类语言数据库(Chen等人,2021年),以支持另外18种编程语言,包括一系列编程范式和受欢迎程度。我们评估了两种关于多PL-E: Codex和InCoder的新型代码生成模型。我们发现,在几种语言上,代码匹配甚至超过其在Python的功能。多语言生成语言的编程语言范围使我们能够探索语言频率和语言特征对模式绩效的影响。最后,为新编译编程语言代码生成基准的多语言的多语言多语言方法既可扩展,又可扩展。