多语言编码生成模型评估 (Multi-lingual Evaluation of Code Generation Models)

Ben Athiwaratkun,Sanjay Krishna Gouda,Zijian Wang,Xiaopeng Li,Yuchen Tian,Ming Tan,Wasi Uddin Ahmad,Shiqi Wang,Qing Sun,Mingyue Shang,Sujan Kumar Gonugondla,Hantian Ding,Varun Kumar,Nathan Fulton,Arash Farahani,Siddhartha Jain,Robert Giaquinto,Haifeng Qian,Murali Krishna Ramanathan,Ramesh Nallapati,Baishakhi Ray,Parminder Bhatia,Sudipta Sengupta,Dan Roth,Bing Xiang

from arxiv, Code and data release: https://github.com/amazon-research/mxeval

We present new benchmarks on evaluation code generation models: MBXP and Multilingual HumanEval, and MathQA-X. These datasets cover over 10 programming languages and are generated using a scalable conversion framework that transpiles prompts and test cases from the original Python datasets into the corresponding data in the target language. Using these benchmarks, we are able to assess the performance of code generation models in a multi-lingual fashion, and discovered generalization ability of language models on out-of-domain languages, advantages of multi-lingual models over mono-lingual, the ability of few-shot prompting to teach the model new languages, and zero-shot translation abilities even on mono-lingual settings. Furthermore, we use our code generation model to perform large-scale bootstrapping to obtain synthetic canonical solutions in several languages, which can be used for other code-related evaluations such as code insertion, robustness, or summarization tasks. Overall, our benchmarks represents a significant step towards a deeper understanding of language models' code generation abilities. We publicly release our code and datasets at https://github.com/amazon-research/mxeval.

翻译：我们提出了新的编码生成模型评估基准：MBXP、Multilingual HumanEval 和 MathQA-X。这些数据集覆盖了 10 多种编程语言，并使用可扩展的转换框架从原始 Python 数据集中将提示和测试用例转译为目标语言中的相应数据。使用这些基准，我们能够在多语言环境中评估编码生成模型的性能，并发现语言模型在超出领域语言上的泛化能力，多语言模型相对于单语言模型的优势，少量提示教授模型新语言的能力，以及连在单语言情况下的零-shot 翻译能力。此外，我们使用编码生成模型进行大规模引导，以在多种语言中获得合成的规范解，这些解可用于其他与代码相关的评估，如代码插入、鲁棒性或摘要任务。总体而言，我们的基准是更深入了解语言模型编码生成能力的一个重要步骤。我们在 https://github.com/amazon-research/mxeval 公开发布了我们的代码和数据集。

相关内容

MoDELS

关注 43

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

百篇论文纵览大型语言模型最新研究进展

专知会员服务

70+阅读 · 2023年3月31日

【ACL2022】解释生成的多尺度分布深度变分自编码器, Multi-Scale Distribution Deep Variational Autoencoder for Explanation Generation

专知会员服务

12+阅读 · 2022年3月24日

【Hugging Face】指导文本生成与约束波束搜索🤗Transformers，Guiding Text Generation with Constrained Beam Search in 🤗 Transformers

专知会员服务

22+阅读 · 2022年3月18日

基于预训练语言模型的文本生成研究综述

专知会员服务

82+阅读 · 2021年10月15日