Software engineers working with the same programming language (PL) may speak different natural languages (NLs) and vice versa, erecting huge barriers to communication and working efficiency. Recent studies have demonstrated the effectiveness of generative pre-training in computer programs, yet they are always English-centric. In this work, we step towards bridging the gap between multilingual NLs and multilingual PLs for large language models (LLMs). We release ERNIE-Code, a unified pre-trained language model for 116 NLs and 6 PLs. We employ two methods for universal cross-lingual pre-training: span-corruption language modeling that learns patterns from monolingual NL or PL; and pivot-based translation language modeling that relies on parallel data of many NLs and PLs. Extensive results show that ERNIE-Code outperforms previous multilingual LLMs for PL or NL across a wide range of end tasks of code intelligence, including multilingual code-to-text, text-to-code, code-to-code, and text-to-text generation. We further show its advantage of zero-shot prompting on multilingual code summarization and text-to-text translation. We will make our code and pre-trained models publicly available.
翻译:使用同一编程语言(PL)的软件工程师可以讲不同的自然语言(NLs),反之亦然,可以说不同的自然语言(NLs),为通信和工作效率设置巨大的障碍。最近的研究表明计算机程序基因化前培训的有效性,但他们总是以英语为中心。在这项工作中,我们采取步骤缩小多种语言NLs与多语言(LLMs)之间的鸿沟。我们发布了ERNIE-Code,这是为116NLs和6PLs提供的统一的预培训语言模式。我们采用了两种通用跨语言培训前语言模式:跨语言模式,从单语NL或PL学习模式;基于Pivit的翻译语言模型,依赖许多NLs和PLs的平行数据。广泛的结果表明,ERNIE-Code在多种代码智能的多种终端任务中,包括多语言代码到文字、文本到代码、代码到文字生成,比我们经过零发版的代码化前和文本转换的优势。我们进一步展示了在多语言代码和文本转换前可以公开提供。