Large pre-trained code generation models, such as OpenAI Codex, can generate syntax- and function-correct code, making the coding of programmers more productive and our pursuit of artificial general intelligence closer. In this paper, we introduce CodeGeeX, a multilingual model with 13 billion parameters for code generation. CodeGeeX is pre-trained on 850 billion tokens of 23 programming languages as of June 2022. Our extensive experiments suggest that CodeGeeX outperforms multilingual code models of similar scale for both the tasks of code generation and translation on HumanEval-X. Building upon HumanEval (Python only), we develop the HumanEval-X benchmark for evaluating multilingual models by hand-writing the solutions in C++, Java, JavaScript, and Go. In addition, we build CodeGeeX-based extensions on Visual Studio Code, JetBrains, and Cloud Studio, generating 4.7 billion tokens for tens of thousands of active users per week. Our user study demonstrates that CodeGeeX can help to increase coding efficiency for 83.4% of its users. Finally, CodeGeeX is publicly accessible and in Sep. 2022, we open-sourced its code, model weights (the version of 850B tokens), API, extensions, and HumanEval-X at https://github.com/THUDM/CodeGeeX.
翻译:大规模的预训练代码生成模型,比如 OpenAI Codex,可以生成语法和功能正确的代码,使程序员的编码更加高效,让我们追求人工通用智能更近了一步。本文介绍了 CodeGeeX,一个具有 130 亿参数用于代码生成的多语言模型。CodeGeeX 在 2022 年 6 月时,已经在 23 种编程语言的 8500 亿个 token 上进行了预训练。我们的广泛实验证明,CodeGeeX 在 HumanEval-X 上对于代码生成和翻译任务优于相似规模的多语言代码模型。在 HumanEval(仅针对 Python)的基础上,我们开发了用于手动编写 C++、Java、JavaScript 和 Go 解决方案的 HumanEval-X 基准测试,以评估多语言模型。此外,我们还在 Visual Studio Code、JetBrains 和 Cloud Studio 上构建了基于 CodeGeeX 的扩展,每周为数以万计的活跃用户生成 47 亿个 token。我们的用户研究表明,CodeGeeX 可以帮助 83.4% 的用户提高编码效率。最后,CodeGeeX 是公开可访问的,并且在 2022 年 9 月,我们开源了其代码、模型权重(850B 版本)、API、扩展和 HumanEval-X,网址为 https://github.com/THUDM/CodeGeeX。