We present MBXP, an execution-based code completion benchmark in 10+ programming languages. This collection of datasets is generated by our conversion framework that translates prompts and test cases from the original MBPP dataset to the corresponding data in a target language. Based on this benchmark, we are able to evaluate code generation models in a multi-lingual fashion, and in particular discover generalization ability of language models on out-of-domain languages, advantages of large multi-lingual models over mono-lingual, benefits of few-shot prompting, and zero-shot translation abilities. In addition, we use our code generation model to perform large-scale bootstrapping to obtain synthetic canonical solutions in several languages. These solutions can be used for other code-related evaluations such as insertion-based, summarization, or code translation tasks where we demonstrate results and release as part of our benchmark.
翻译:我们用10+编程语言介绍了基于执行的代码完成基准MBXP。这种数据集的收集来自我们的转换框架,将原始的MBP数据集的速率和测试案例翻译成一种目标语言的相应数据。根据这一基准,我们能够以多种语文的方式评价代码生成模型,特别是发现语言模型的外语通用能力、大型多语言模型优于单语的优势、微弱提示的好处和零光化翻译能力。此外,我们还利用我们的代码生成模型进行大规模靴式穿梭,以获得多种语言的合成罐头解决方案。这些解决方案可用于其他代码相关评估,如插入、合成或代码翻译等,在这些评估中,我们将结果和发布作为基准的一部分。