ERNIE-Code:除英语中心外,编程语言跨语言跨语言预科培训 (ERNIE-Code: Beyond English-Centric Cross-lingual Pretraining for Programming Languages) - 专知论文

会员服务 ·

0

语言模型化 · MoDELS · Extensibility · 代码 · 值域 ·

2022 年 12 月 13 日

ERNIE-Code: Beyond English-Centric Cross-lingual Pretraining for Programming Languages

翻译：ERNIE-Code:除英语中心外,编程语言跨语言跨语言预科培训

Yekun Chai,Shuohuan Wang,Chao Pang,Yu Sun,Hao Tian,Hua Wu

Software engineers working with the same programming language (PL) may speak different natural languages (NLs) and vice versa, erecting huge barriers to communication and working efficiency. Recent studies have demonstrated the effectiveness of generative pre-training in computer programs, yet they are always English-centric. In this work, we step towards bridging the gap between multilingual NLs and multilingual PLs for large language models (LLMs). We release ERNIE-Code, a unified pre-trained language model for 116 NLs and 6 PLs. We employ two methods for universal cross-lingual pre-training: span-corruption language modeling that learns patterns from monolingual NL or PL; and pivot-based translation language modeling that relies on parallel data of many NLs and PLs. Extensive results show that ERNIE-Code outperforms previous multilingual LLMs for PL or NL across a wide range of end tasks of code intelligence, including multilingual code-to-text, text-to-code, code-to-code, and text-to-text generation. We further show its advantage of zero-shot prompting on multilingual code summarization and text-to-text translation. We will make our code and pre-trained models publicly available.

翻译：使用同一编程语言(PL)的软件工程师可以讲不同的自然语言(NLs),反之亦然,可以说不同的自然语言(NLs),为通信和工作效率设置巨大的障碍。最近的研究表明计算机程序基因化前培训的有效性,但他们总是以英语为中心。在这项工作中,我们采取步骤缩小多种语言NLs与多语言(LLMs)之间的鸿沟。我们发布了ERNIE-Code,这是为116NLs和6PLs提供的统一的预培训语言模式。我们采用了两种通用跨语言培训前语言模式:跨语言模式,从单语NL或PL学习模式;基于Pivit的翻译语言模型,依赖许多NLs和PLs的平行数据。广泛的结果表明,ERNIE-Code在多种代码智能的多种终端任务中,包括多语言代码到文字、文本到代码、代码到文字生成,比我们经过零发版的代码化前和文本转换的优势。我们进一步展示了在多语言代码和文本转换前可以公开提供。

0

相关内容

语言模型化

语言模型化

自然语言处理顶会NAACL2022最佳论文出炉！

自然语言处理顶会NAACL2022最佳论文出炉！

专知会员服务

43+阅读 · 2022年6月30日

Artificial Intelligence: Ready to Ride the Wave? BCG 28页PPT

Artificial Intelligence: Ready to Ride the Wave? BCG 28页PPT

专知会员服务

28+阅读 · 2022年2月20日

NLP必读经典文献100篇

专知会员服务

124+阅读 · 2020年9月8日

2020数据工程师成长路线图

专知会员服务

19+阅读 · 2020年9月6日

【论文翻译】2020最新预训练语言模型综述：Pre-trained Models for Natural Language Processing: A Survey

【论文翻译】2020最新预训练语言模型综述：Pre-trained Models for Natural Language Processing: A Survey

专知会员服务

94+阅读 · 2020年4月13日

100+篇《自监督学习(Self-Supervised Learning)》论文最新合集

100+篇《自监督学习(Self-Supervised Learning)》论文最新合集

专知会员服务

166+阅读 · 2020年3月18日

【跨语言BERT模型大集合】Transfer learning is increasingly going multilingual with language-specific BERT models

专知会员服务

54+阅读 · 2020年1月30日

【NLP模型的跨语言/跨领域迁移】《Transferring NLP models across languages and domains》

【NLP模型的跨语言/跨领域迁移】《Transferring NLP models across languages and domains》

专知会员服务

43+阅读 · 2019年11月25日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

49+阅读 · 2019年10月17日

最新BERT相关论文清单，BERT-related Papers

最新BERT相关论文清单，BERT-related Papers

专知会员服务

53+阅读 · 2019年9月29日

VCIP 2022 Call for Demos

VCIP 2022 Call for Demos

CCF多媒体专委会

1+阅读 · 2022年6月6日

【ICIG2021】Check out the hot new trailer of ICIG2021 Symposium8

【ICIG2021】Check out the hot new trailer of ICIG2021 Symposium8

中国图象图形学学会CSIG

0+阅读 · 2021年11月16日

【ICIG2021】Check out the hot new trailer of ICIG2021 Symposium6

【ICIG2021】Check out the hot new trailer of ICIG2021 Symposium6

中国图象图形学学会CSIG

2+阅读 · 2021年11月12日

【ICIG2021】Latest News & Announcements of the Plenary Talk2

【ICIG2021】Latest News & Announcements of the Plenary Talk2

中国图象图形学学会CSIG

0+阅读 · 2021年11月2日

【ICIG2021】Latest News & Announcements of the Plenary Talk1

【ICIG2021】Latest News & Announcements of the Plenary Talk1

中国图象图形学学会CSIG

0+阅读 · 2021年11月1日

【ICIG2021】Latest News & Announcements of the Industry Talk2

【ICIG2021】Latest News & Announcements of the Industry Talk2

中国图象图形学学会CSIG

0+阅读 · 2021年7月29日

【ICIG2021】Latest News & Announcements of the Industry Talk1

【ICIG2021】Latest News & Announcements of the Industry Talk1

中国图象图形学学会CSIG

0+阅读 · 2021年7月28日

Transferring Knowledge across Learning Processes

Transferring Knowledge across Learning Processes

CreateAMind

29+阅读 · 2019年5月18日

Unsupervised Learning via Meta-Learning

Unsupervised Learning via Meta-Learning

CreateAMind

43+阅读 · 2019年1月3日

pytorch-pretrained-BERT：BERT PyTorch实现，可加载Google BERT预训练模型

pytorch-pretrained-BERT：BERT PyTorch实现，可加载Google BERT预训练模型

AINLP

35+阅读 · 2018年11月6日

一类离散Hindmarsh-Rose模型的分支延拓

国家自然科学基金

0+阅读 · 2015年12月31日

PP2A调控CFTR去磷酸化的分子机制及在COPD早期预防中的作用

国家自然科学基金

0+阅读 · 2015年12月31日

煤层顶板突水机理及突水危险性分区预测方法研究

国家自然科学基金

0+阅读 · 2014年12月31日

GTAT4和Myocardin相互作用调控心肌肥厚

国家自然科学基金

0+阅读 · 2014年12月31日

SiO2-抗生素纳米抗菌剂的制备及在细菌感染诊疗中的应用研究

国家自然科学基金

0+阅读 · 2013年12月31日

猪繁殖与呼吸综合征病毒ORF1b影响其致病性的分子机制

国家自然科学基金

0+阅读 · 2013年12月31日

面向专利文献的统计机器翻译语境分析

国家自然科学基金

0+阅读 · 2013年12月31日

OsDCL3b基因调控稻穗生长发育的遗传机理研究

国家自然科学基金

0+阅读 · 2012年12月31日

利用基因表达谱研究精神分裂症多变量协同机制及分类算法

国家自然科学基金

0+阅读 · 2012年12月31日

抗PRLR全人源抗体的研制及其协同Herceptin抗乳腺癌细胞增殖作用的研究

国家自然科学基金

0+阅读 · 2011年12月31日

The Programmer's Assistant: Conversational Interaction with a Large Language Model for Software Development

Arxiv

0+阅读 · 2023年2月14日

EntityCS: Improving Zero-Shot Cross-lingual Transfer with Entity-Centric Code Switching

Arxiv

0+阅读 · 2023年2月13日

Bootstrapping Multilingual Semantic Parsers using Large Language Models

Arxiv

0+阅读 · 2023年2月11日

Summarize and Generate to Back-translate: Unsupervised Translation of Programming Languages

Arxiv

0+阅读 · 2023年2月11日

CodeBERTScore: Evaluating Code Generation with Pretrained Models of Code

Arxiv

0+阅读 · 2023年2月10日

Reliable Natural Language Understanding with Large Language Models and Answer Set Programming

Arxiv

0+阅读 · 2023年2月9日

Pretrained Transformers for Text Ranking: BERT and Beyond

Arxiv

28+阅读 · 2020年10月13日

Pre-training Text Representations as Meta Learning

Arxiv

13+阅读 · 2020年4月12日

UniLMv2: Pseudo-Masked Language Models for Unified Language Model Pre-Training

Arxiv

15+阅读 · 2020年2月28日

TinyBERT: Distilling BERT for Natural Language Understanding

TinyBERT: Distilling BERT for Natural Language Understanding

Arxiv

11+阅读 · 2019年9月23日

VIP会员

文章信息

相关主题

语言模型化

相关VIP内容

自然语言处理顶会NAACL2022最佳论文出炉！

自然语言处理顶会NAACL2022最佳论文出炉！

专知会员服务

43+阅读 · 2022年6月30日

Artificial Intelligence: Ready to Ride the Wave? BCG 28页PPT

Artificial Intelligence: Ready to Ride the Wave? BCG 28页PPT

专知会员服务

28+阅读 · 2022年2月20日

NLP必读经典文献100篇

专知会员服务

124+阅读 · 2020年9月8日

2020数据工程师成长路线图

专知会员服务

19+阅读 · 2020年9月6日

【论文翻译】2020最新预训练语言模型综述：Pre-trained Models for Natural Language Processing: A Survey

【论文翻译】2020最新预训练语言模型综述：Pre-trained Models for Natural Language Processing: A Survey

专知会员服务

94+阅读 · 2020年4月13日

100+篇《自监督学习(Self-Supervised Learning)》论文最新合集

100+篇《自监督学习(Self-Supervised Learning)》论文最新合集

专知会员服务

166+阅读 · 2020年3月18日

【跨语言BERT模型大集合】Transfer learning is increasingly going multilingual with language-specific BERT models

专知会员服务

54+阅读 · 2020年1月30日

【NLP模型的跨语言/跨领域迁移】《Transferring NLP models across languages and domains》

【NLP模型的跨语言/跨领域迁移】《Transferring NLP models across languages and domains》

专知会员服务

43+阅读 · 2019年11月25日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

49+阅读 · 2019年10月17日

最新BERT相关论文清单，BERT-related Papers

最新BERT相关论文清单，BERT-related Papers

专知会员服务

53+阅读 · 2019年9月29日

热门VIP内容

开通专知VIP会员享更多权益服务

《人与智能体在系统工程建模语言V2任务中的性能表现：基于用户中心化的评估方法》308页

《数据安全国家标准体系（2025版）》征求意见稿

AlphaMosaic：人工智能赋能的作战管理系统

《军事行动中通信平台的战略价值：提升战术效能与作战优势》

相关资讯

VCIP 2022 Call for Demos

VCIP 2022 Call for Demos

CCF多媒体专委会

1+阅读 · 2022年6月6日

【ICIG2021】Check out the hot new trailer of ICIG2021 Symposium8

【ICIG2021】Check out the hot new trailer of ICIG2021 Symposium8

中国图象图形学学会CSIG

0+阅读 · 2021年11月16日

【ICIG2021】Check out the hot new trailer of ICIG2021 Symposium6

【ICIG2021】Check out the hot new trailer of ICIG2021 Symposium6

中国图象图形学学会CSIG

2+阅读 · 2021年11月12日

【ICIG2021】Latest News & Announcements of the Plenary Talk2

【ICIG2021】Latest News & Announcements of the Plenary Talk2

中国图象图形学学会CSIG

0+阅读 · 2021年11月2日

【ICIG2021】Latest News & Announcements of the Plenary Talk1

【ICIG2021】Latest News & Announcements of the Plenary Talk1

中国图象图形学学会CSIG

0+阅读 · 2021年11月1日

【ICIG2021】Latest News & Announcements of the Industry Talk2

【ICIG2021】Latest News & Announcements of the Industry Talk2

中国图象图形学学会CSIG

0+阅读 · 2021年7月29日

【ICIG2021】Latest News & Announcements of the Industry Talk1

【ICIG2021】Latest News & Announcements of the Industry Talk1

中国图象图形学学会CSIG

0+阅读 · 2021年7月28日

Transferring Knowledge across Learning Processes

Transferring Knowledge across Learning Processes

CreateAMind

29+阅读 · 2019年5月18日

Unsupervised Learning via Meta-Learning

Unsupervised Learning via Meta-Learning

CreateAMind

43+阅读 · 2019年1月3日

pytorch-pretrained-BERT：BERT PyTorch实现，可加载Google BERT预训练模型

pytorch-pretrained-BERT：BERT PyTorch实现，可加载Google BERT预训练模型

AINLP

35+阅读 · 2018年11月6日

相关论文

The Programmer's Assistant: Conversational Interaction with a Large Language Model for Software Development

Arxiv

0+阅读 · 2023年2月14日

EntityCS: Improving Zero-Shot Cross-lingual Transfer with Entity-Centric Code Switching

Arxiv

0+阅读 · 2023年2月13日

Bootstrapping Multilingual Semantic Parsers using Large Language Models

Arxiv

0+阅读 · 2023年2月11日

Summarize and Generate to Back-translate: Unsupervised Translation of Programming Languages

Arxiv

0+阅读 · 2023年2月11日

CodeBERTScore: Evaluating Code Generation with Pretrained Models of Code

Arxiv

0+阅读 · 2023年2月10日

Reliable Natural Language Understanding with Large Language Models and Answer Set Programming

Arxiv

0+阅读 · 2023年2月9日

Pretrained Transformers for Text Ranking: BERT and Beyond

Arxiv

28+阅读 · 2020年10月13日

Pre-training Text Representations as Meta Learning

Arxiv

13+阅读 · 2020年4月12日

UniLMv2: Pseudo-Masked Language Models for Unified Language Model Pre-Training

Arxiv

15+阅读 · 2020年2月28日

TinyBERT: Distilling BERT for Natural Language Understanding

TinyBERT: Distilling BERT for Natural Language Understanding

Arxiv

11+阅读 · 2019年9月23日

相关基金

一类离散Hindmarsh-Rose模型的分支延拓

国家自然科学基金

0+阅读 · 2015年12月31日

PP2A调控CFTR去磷酸化的分子机制及在COPD早期预防中的作用

国家自然科学基金

0+阅读 · 2015年12月31日

煤层顶板突水机理及突水危险性分区预测方法研究

国家自然科学基金

0+阅读 · 2014年12月31日

GTAT4和Myocardin相互作用调控心肌肥厚

国家自然科学基金

0+阅读 · 2014年12月31日

SiO2-抗生素纳米抗菌剂的制备及在细菌感染诊疗中的应用研究

国家自然科学基金

0+阅读 · 2013年12月31日

猪繁殖与呼吸综合征病毒ORF1b影响其致病性的分子机制

国家自然科学基金

0+阅读 · 2013年12月31日

面向专利文献的统计机器翻译语境分析

国家自然科学基金

0+阅读 · 2013年12月31日

OsDCL3b基因调控稻穗生长发育的遗传机理研究

国家自然科学基金

0+阅读 · 2012年12月31日

利用基因表达谱研究精神分裂症多变量协同机制及分类算法

国家自然科学基金

0+阅读 · 2012年12月31日

抗PRLR全人源抗体的研制及其协同Herceptin抗乳腺癌细胞增殖作用的研究

国家自然科学基金

0+阅读 · 2011年12月31日

微信扫码咨询专知VIP会员