CodeGeeX: 一个预训练的、具有多语言评估的代码生成模型 (CodeGeeX: A Pre-Trained Model for Code Generation with Multilingual Evaluations on HumanEval-X) - 专知论文

会员服务 ·

0

代码生成 · 代码 · 生成模型 · 预训练 · 词元分析器 ·

2023 年 3 月 30 日

CodeGeeX: A Pre-Trained Model for Code Generation with Multilingual Evaluations on HumanEval-X

翻译：CodeGeeX: 一个预训练的、具有多语言评估的代码生成模型

Qinkai Zheng,Xiao Xia,Xu Zou,Yuxiao Dong,Shan Wang,Yufei Xue,Zihan Wang,Lei Shen,Andi Wang,Yang Li,Teng Su,Zhilin Yang,Jie Tang

Large pre-trained code generation models, such as OpenAI Codex, can generate syntax- and function-correct code, making the coding of programmers more productive and our pursuit of artificial general intelligence closer. In this paper, we introduce CodeGeeX, a multilingual model with 13 billion parameters for code generation. CodeGeeX is pre-trained on 850 billion tokens of 23 programming languages as of June 2022. Our extensive experiments suggest that CodeGeeX outperforms multilingual code models of similar scale for both the tasks of code generation and translation on HumanEval-X. Building upon HumanEval (Python only), we develop the HumanEval-X benchmark for evaluating multilingual models by hand-writing the solutions in C++, Java, JavaScript, and Go. In addition, we build CodeGeeX-based extensions on Visual Studio Code, JetBrains, and Cloud Studio, generating 4.7 billion tokens for tens of thousands of active users per week. Our user study demonstrates that CodeGeeX can help to increase coding efficiency for 83.4% of its users. Finally, CodeGeeX is publicly accessible and in Sep. 2022, we open-sourced its code, model weights (the version of 850B tokens), API, extensions, and HumanEval-X at https://github.com/THUDM/CodeGeeX.

翻译：大规模的预训练代码生成模型，比如 OpenAI Codex，可以生成语法和功能正确的代码，使程序员的编码更加高效，让我们追求人工通用智能更近了一步。本文介绍了 CodeGeeX，一个具有 130 亿参数用于代码生成的多语言模型。CodeGeeX 在 2022 年 6 月时，已经在 23 种编程语言的 8500 亿个 token 上进行了预训练。我们的广泛实验证明，CodeGeeX 在 HumanEval-X 上对于代码生成和翻译任务优于相似规模的多语言代码模型。在 HumanEval（仅针对 Python）的基础上，我们开发了用于手动编写 C++、Java、JavaScript 和 Go 解决方案的 HumanEval-X 基准测试，以评估多语言模型。此外，我们还在 Visual Studio Code、JetBrains 和 Cloud Studio 上构建了基于 CodeGeeX 的扩展，每周为数以万计的活跃用户生成 47 亿个 token。我们的用户研究表明，CodeGeeX 可以帮助 83.4％的用户提高编码效率。最后，CodeGeeX 是公开可访问的，并且在 2022 年 9 月，我们开源了其代码、模型权重（850B 版本）、API、扩展和 HumanEval-X，网址为 https://github.com/THUDM/CodeGeeX。

1

相关内容

代码生成

百篇论文纵览大型语言模型最新研究进展

百篇论文纵览大型语言模型最新研究进展

专知会员服务

70+阅读 · 2023年3月31日

【ICML2020】统一预训练伪掩码语言模型

【ICML2020】统一预训练伪掩码语言模型

专知会员服务

27+阅读 · 2020年7月23日

【微软】大型神经语言模型的对抗性训练，Adversarial Training for Large Neural Language Models

【微软】大型神经语言模型的对抗性训练，Adversarial Training for Large Neural Language Models

专知会员服务

51+阅读 · 2020年5月3日

【2020新书】自然语言处理Python与spaCy实践，216页pdf，NLP with Python

【2020新书】自然语言处理Python与spaCy实践，216页pdf，NLP with Python

专知会员服务

108+阅读 · 2020年5月1日

【NLP模型压缩方法综述】《A Survey of Methods for Model Compression in NLP》by Madison May

【NLP模型压缩方法综述】《A Survey of Methods for Model Compression in NLP》by Madison May

专知会员服务

43+阅读 · 2020年4月22日

【预训练论文】预训练Transformer校准，Calibration of Pre-trained Transformers

【预训练论文】预训练Transformer校准，Calibration of Pre-trained Transformers

专知会员服务

26+阅读 · 2020年3月19日

【Amazon】使用预先训练的Transformer模型进行数据增强，Data Augmentation using Pre-trained Transformer Models

【Amazon】使用预先训练的Transformer模型进行数据增强，Data Augmentation using Pre-trained Transformer Models

专知会员服务

51+阅读 · 2020年3月7日

【微软亚洲研究院】CodeBERT:用于编程和自然语言的预训练模型，CodeBERT: A Pre-Trained Model for Programming and Natural Languages

【微软亚洲研究院】CodeBERT:用于编程和自然语言的预训练模型，CodeBERT: A Pre-Trained Model for Programming and Natural Languages

专知会员服务

32+阅读 · 2020年2月21日

【跨语言BERT模型大集合】Transfer learning is increasingly going multilingual with language-specific BERT models

专知会员服务

54+阅读 · 2020年1月30日

微软发布DialoGPT预训练语言模型，论文与代码 Large-Scale Generative Pre-training for Conversational Response Generation

微软发布DialoGPT预训练语言模型，论文与代码 Large-Scale Generative Pre-training for Conversational Response Generation

专知会员服务

28+阅读 · 2019年11月8日

清华CodeGeeX项目原作解读：大规模多语言代码生成模型

清华CodeGeeX项目原作解读：大规模多语言代码生成模型

机器之心

8+阅读 · 2022年11月4日

CLUE社区最新神器！PromptCLUE：大规模多任务Prompt预训练中文开源模型

CLUE社区最新神器！PromptCLUE：大规模多任务Prompt预训练中文开源模型

新智元

0+阅读 · 2022年10月30日

META微软等最新ACL2022教程《非自回归序列生成》，168页ppt

META微软等最新ACL2022教程《非自回归序列生成》，168页ppt

专知

2+阅读 · 2022年6月3日

Pytorch多模态框架MMF

Pytorch多模态框架MMF

专知

49+阅读 · 2020年6月20日

RoBERTa中文预训练模型：RoBERTa for Chinese

RoBERTa中文预训练模型：RoBERTa for Chinese

PaperWeekly

57+阅读 · 2019年9月16日

RoBERTa for Chinese：大规模中文预训练RoBERTa模型

RoBERTa for Chinese：大规模中文预训练RoBERTa模型

AINLP

30+阅读 · 2019年9月8日

最强NLP预训练模型库PyTorch-Transformers正式开源！支持6个预训练框架，27个预训练模型

最强NLP预训练模型库PyTorch-Transformers正式开源！支持6个预训练框架，27个预训练模型

AI前线

12+阅读 · 2019年7月22日

GitHub超9千星：一个API调用27个NLP预训练模型

GitHub超9千星：一个API调用27个NLP预训练模型

新智元

17+阅读 · 2019年7月22日

pytorch-pretrained-BERT：BERT PyTorch实现，可加载Google BERT预训练模型

pytorch-pretrained-BERT：BERT PyTorch实现，可加载Google BERT预训练模型

AINLP

35+阅读 · 2018年11月6日

谷歌发表的史上最强NLP模型BERT的官方代码和预训练模型可以下载了

谷歌发表的史上最强NLP模型BERT的官方代码和预训练模型可以下载了

AINLP

12+阅读 · 2018年11月1日

面向协同的设计重用启发模型

国家自然科学基金

0+阅读 · 2013年12月31日

基于Bmx激酶结构的Type II型抑制剂的设计、合成及构效关系研究

国家自然科学基金

0+阅读 · 2013年12月31日

非人类灵长类动物（猕猴）脊髓撞击损伤模型的建立和行为学评估

国家自然科学基金

0+阅读 · 2012年12月31日

高填充木塑复合材料的流变性和挤出模拟研究

国家自然科学基金

0+阅读 · 2012年12月31日

开放式结构拓扑优化软件设计与研发

国家自然科学基金

1+阅读 · 2012年12月31日

计算力学基本计算及可视化工具程序包的开发与集成

国家自然科学基金

2+阅读 · 2012年12月31日

面向软件网络模型的复杂软件系统测试框架和技术研究

国家自然科学基金

1+阅读 · 2012年12月31日

面向对象的跨系统计算力学软件平台与图形用户界面的研发

国家自然科学基金

0+阅读 · 2012年12月31日

体外构建角膜内皮细胞膜片行后弹力层内皮移植后的功能评价

国家自然科学基金

0+阅读 · 2011年12月31日

面向查询的XML文本自动文摘研究

国家自然科学基金

0+阅读 · 2008年12月31日

HELMA: A Large-Scale Hallucination Evaluation Benchmark for Large Language Models

Arxiv

0+阅读 · 2023年5月19日

Execution-Based Evaluation for Open-Domain Code Generation

Arxiv

0+阅读 · 2023年5月19日

ERNIE-Code: Beyond English-Centric Cross-lingual Pretraining for Programming Languages

Arxiv

0+阅读 · 2023年5月19日

Evaluating task understanding through multilingual consistency: A ChatGPT case study

Arxiv

0+阅读 · 2023年5月19日

CCT-Code: Cross-Consistency Training for Multilingual Clone Detection and Code Search

Arxiv

0+阅读 · 2023年5月19日

VisionLLM: Large Language Model is also an Open-Ended Decoder for Vision-Centric Tasks

Arxiv

0+阅读 · 2023年5月18日

TrueTeacher: Learning Factual Consistency Evaluation with Large Language Models

Arxiv

0+阅读 · 2023年5月18日

SimOAP: Improve Coherence and Consistency in Persona-based Dialogue Generation via Over-sampling and Post-evaluation

Arxiv

0+阅读 · 2023年5月18日

Chain-of-Symbol Prompting Elicits Planning in Large Langauge Models

Arxiv

0+阅读 · 2023年5月17日

UniViLM: A Unified Video and Language Pre-Training Model for Multimodal Understanding and Generation

UniViLM: A Unified Video and Language Pre-Training Model for Multimodal Understanding and Generation

Arxiv

19+阅读 · 2020年2月15日

VIP会员

文章信息

相关主题

词元分析器

相关VIP内容

百篇论文纵览大型语言模型最新研究进展

百篇论文纵览大型语言模型最新研究进展

专知会员服务

70+阅读 · 2023年3月31日

【ICML2020】统一预训练伪掩码语言模型

【ICML2020】统一预训练伪掩码语言模型

专知会员服务

27+阅读 · 2020年7月23日

【微软】大型神经语言模型的对抗性训练，Adversarial Training for Large Neural Language Models

【微软】大型神经语言模型的对抗性训练，Adversarial Training for Large Neural Language Models

专知会员服务

51+阅读 · 2020年5月3日

【2020新书】自然语言处理Python与spaCy实践，216页pdf，NLP with Python

【2020新书】自然语言处理Python与spaCy实践，216页pdf，NLP with Python

专知会员服务

108+阅读 · 2020年5月1日

【NLP模型压缩方法综述】《A Survey of Methods for Model Compression in NLP》by Madison May

【NLP模型压缩方法综述】《A Survey of Methods for Model Compression in NLP》by Madison May

专知会员服务

43+阅读 · 2020年4月22日

【预训练论文】预训练Transformer校准，Calibration of Pre-trained Transformers

【预训练论文】预训练Transformer校准，Calibration of Pre-trained Transformers

专知会员服务

26+阅读 · 2020年3月19日

【Amazon】使用预先训练的Transformer模型进行数据增强，Data Augmentation using Pre-trained Transformer Models

【Amazon】使用预先训练的Transformer模型进行数据增强，Data Augmentation using Pre-trained Transformer Models

专知会员服务

51+阅读 · 2020年3月7日

【微软亚洲研究院】CodeBERT:用于编程和自然语言的预训练模型，CodeBERT: A Pre-Trained Model for Programming and Natural Languages

【微软亚洲研究院】CodeBERT:用于编程和自然语言的预训练模型，CodeBERT: A Pre-Trained Model for Programming and Natural Languages

专知会员服务

32+阅读 · 2020年2月21日

【跨语言BERT模型大集合】Transfer learning is increasingly going multilingual with language-specific BERT models

专知会员服务

54+阅读 · 2020年1月30日

微软发布DialoGPT预训练语言模型，论文与代码 Large-Scale Generative Pre-training for Conversational Response Generation

微软发布DialoGPT预训练语言模型，论文与代码 Large-Scale Generative Pre-training for Conversational Response Generation

专知会员服务

28+阅读 · 2019年11月8日

热门VIP内容

开通专知VIP会员享更多权益服务

新质生成式AI赋能产业变革的实践与路径

用于多模态大模型的离散标记化：全面综述

Nature综述：金融网络中的物理学

【CMU博士论文】通信高效且差分隐私的优化方法

相关资讯

清华CodeGeeX项目原作解读：大规模多语言代码生成模型

清华CodeGeeX项目原作解读：大规模多语言代码生成模型

机器之心

8+阅读 · 2022年11月4日

CLUE社区最新神器！PromptCLUE：大规模多任务Prompt预训练中文开源模型

CLUE社区最新神器！PromptCLUE：大规模多任务Prompt预训练中文开源模型

新智元

0+阅读 · 2022年10月30日

META微软等最新ACL2022教程《非自回归序列生成》，168页ppt

META微软等最新ACL2022教程《非自回归序列生成》，168页ppt

专知

2+阅读 · 2022年6月3日

Pytorch多模态框架MMF

Pytorch多模态框架MMF

专知

49+阅读 · 2020年6月20日

RoBERTa中文预训练模型：RoBERTa for Chinese

RoBERTa中文预训练模型：RoBERTa for Chinese

PaperWeekly

57+阅读 · 2019年9月16日

RoBERTa for Chinese：大规模中文预训练RoBERTa模型

RoBERTa for Chinese：大规模中文预训练RoBERTa模型

AINLP

30+阅读 · 2019年9月8日

最强NLP预训练模型库PyTorch-Transformers正式开源！支持6个预训练框架，27个预训练模型

最强NLP预训练模型库PyTorch-Transformers正式开源！支持6个预训练框架，27个预训练模型

AI前线

12+阅读 · 2019年7月22日

GitHub超9千星：一个API调用27个NLP预训练模型

GitHub超9千星：一个API调用27个NLP预训练模型

新智元

17+阅读 · 2019年7月22日

pytorch-pretrained-BERT：BERT PyTorch实现，可加载Google BERT预训练模型

pytorch-pretrained-BERT：BERT PyTorch实现，可加载Google BERT预训练模型

AINLP

35+阅读 · 2018年11月6日

谷歌发表的史上最强NLP模型BERT的官方代码和预训练模型可以下载了

谷歌发表的史上最强NLP模型BERT的官方代码和预训练模型可以下载了

AINLP

12+阅读 · 2018年11月1日

相关论文

HELMA: A Large-Scale Hallucination Evaluation Benchmark for Large Language Models

Arxiv

0+阅读 · 2023年5月19日

Execution-Based Evaluation for Open-Domain Code Generation

Arxiv

0+阅读 · 2023年5月19日

ERNIE-Code: Beyond English-Centric Cross-lingual Pretraining for Programming Languages

Arxiv

0+阅读 · 2023年5月19日

Evaluating task understanding through multilingual consistency: A ChatGPT case study

Arxiv

0+阅读 · 2023年5月19日

CCT-Code: Cross-Consistency Training for Multilingual Clone Detection and Code Search

Arxiv

0+阅读 · 2023年5月19日

VisionLLM: Large Language Model is also an Open-Ended Decoder for Vision-Centric Tasks

Arxiv

0+阅读 · 2023年5月18日

TrueTeacher: Learning Factual Consistency Evaluation with Large Language Models

Arxiv

0+阅读 · 2023年5月18日

SimOAP: Improve Coherence and Consistency in Persona-based Dialogue Generation via Over-sampling and Post-evaluation

Arxiv

0+阅读 · 2023年5月18日

Chain-of-Symbol Prompting Elicits Planning in Large Langauge Models

Arxiv

0+阅读 · 2023年5月17日

UniViLM: A Unified Video and Language Pre-Training Model for Multimodal Understanding and Generation

UniViLM: A Unified Video and Language Pre-Training Model for Multimodal Understanding and Generation

Arxiv

19+阅读 · 2020年2月15日

相关基金

面向协同的设计重用启发模型

国家自然科学基金

0+阅读 · 2013年12月31日

基于Bmx激酶结构的Type II型抑制剂的设计、合成及构效关系研究

国家自然科学基金

0+阅读 · 2013年12月31日

非人类灵长类动物（猕猴）脊髓撞击损伤模型的建立和行为学评估

国家自然科学基金

0+阅读 · 2012年12月31日

高填充木塑复合材料的流变性和挤出模拟研究

国家自然科学基金

0+阅读 · 2012年12月31日

开放式结构拓扑优化软件设计与研发

国家自然科学基金

1+阅读 · 2012年12月31日

计算力学基本计算及可视化工具程序包的开发与集成

国家自然科学基金

2+阅读 · 2012年12月31日

面向软件网络模型的复杂软件系统测试框架和技术研究

国家自然科学基金

1+阅读 · 2012年12月31日

面向对象的跨系统计算力学软件平台与图形用户界面的研发

国家自然科学基金

0+阅读 · 2012年12月31日

体外构建角膜内皮细胞膜片行后弹力层内皮移植后的功能评价

国家自然科学基金

0+阅读 · 2011年12月31日

面向查询的XML文本自动文摘研究

国家自然科学基金

0+阅读 · 2008年12月31日

微信扫码咨询专知VIP会员