We present CoTexT, a pre-trained, transformer-based encoder-decoder model that learns the representative context between natural language (NL) and programming language (PL). Using self-supervision, CoTexT is pre-trained on large programming language corpora to learn a general understanding of language and code. CoTexT supports downstream NL-PL tasks such as code summarizing/documentation, code generation, defect detection, and code debugging. We train CoTexT on different combinations of available PL corpus including both "bimodal" and "unimodal" data. Here, bimodal data is the combination of text and corresponding code snippets, whereas unimodal data is merely code snippets. We first evaluate CoTexT with multi-task learning: we perform Code Summarization on 6 different programming languages and Code Refinement on both small and medium size featured in the CodeXGLUE dataset. We further conduct extensive experiments to investigate CoTexT on other tasks within the CodeXGlue dataset, including Code Generation and Defect Detection. We consistently achieve SOTA results in these tasks, demonstrating the versatility of our models.
翻译:我们介绍CotexT, 这是一种预先训练的、基于变压器的编码器-编码解码模型,可以学习自然语言(NL)和编程语言(PL)之间的代表性背景。使用自我监督,CTexT接受大型编程语言公司的培训,以学习对语言和代码的一般理解。CotexT支持下游的NL-PL任务,如代码总结/文件、代码生成、缺陷检测和代码调试等。我们培训CotexT,学习现有PPL的不同组合,包括“双式”和“单式”数据。这里,双式数据是文本和相应的代码片断的组合,而单式数据只是代码片断。我们首先用多任务学习来评估CotexT:我们用代码对6种不同的编程语言进行校正,并对代码XGLUE数据集中的中小尺寸进行校正。我们进一步进行广泛的实验,以调查代码XLue数据集中的其他任务,包括代码生成和解剖模型。