Code generation models based on the pre-training and fine-tuning paradigm have been increasingly attempted by both academia and industry, resulting in well-known industrial models such as Codex, CodeGen, and PanGu-Coder. To validate the performance of these models, multiple existing benchmarks (e.g., AiXBench and HumanEval) are proposed, including only cases of generating a standalone function, i.e., a function that invokes or accesses only built-in functions and standard libraries. However, standalone functions constitute only about 30\% of functions from real open-source projects. To assess a model's performance for pragmatic code generation (i.e., code generation for real settings of open source or proprietary code), in this paper, we propose a benchmark named CoderEval of pragmatic code generation with generative pre-trained models. Compared with the widely-used HumanEval benchmark from OpenAI, CoderEval can be used to assess the performance of models against pragmatic code generation beyond just generating standalone functions. Through the evaluation of three public available models (CodeGen, PanGu-Coder, and Codex) on CoderEval, we analyze and discuss the current progress and future directions of pragmatic code generation with a generative pre-trained model.
翻译:以培训前和微调范式为基础的守则生成模式,学术界和工业界都越来越多地尝试采用这种模式,结果产生了众所周知的工业模式,如Codelx、CodeGen和PanGu-Coder。为了验证这些模式的性能,我们提议了多种现有基准(例如AiXucench和HumanEval),包括仅产生独立功能,即仅援引或进入内部功能和标准图书馆的功能;然而,独立功能仅构成实际开放源码项目功能的约30 ⁇ 左右。为了评估实用代码生成(即开放源码或专利码实际设置的代码生成)的性能,我们在本文件中提出了称为实用代码生成基准的CoderEval,并附有基因化的预培训模型。与OpenAI广泛使用的HumanEval基准相比,CocrEval可用于评估模型的性能,而不是仅仅产生独立功能的实用代码生成。通过对三种公共可用模型(CodeGen、PanGu-Coder-Coder)进行评估,并讨论当前实际生成指南和代码前的进度。