Code summarization and generation empower conversion between programming language (PL) and natural language (NL), while code translation avails the migration of legacy code from one PL to another. This paper introduces PLBART, a sequence-to-sequence model capable of performing a broad spectrum of program and language understanding and generation tasks. PLBART is pre-trained on an extensive collection of Java and Python functions and associated NL text via denoising autoencoding. Experiments on code summarization in the English language, code generation, and code translation in seven programming languages show that PLBART outperforms or rivals state-of-the-art models. Moreover, experiments on discriminative tasks, e.g., program repair, clone detection, and vulnerable code detection, demonstrate PLBART's effectiveness in program understanding. Furthermore, analysis reveals that PLBART learns program syntax, style (e.g., identifier naming convention), logical flow (e.g., if block inside an else block is equivalent to else if block) that are crucial to program semantics and thus excels even with limited annotations.
翻译:代码总和和生成赋予了编程语言(PL)和自然语言(NL)之间的转换能力,而代码翻译则利用了遗留代码从一个PL向另一个PL的迁移。本文介绍了PLBART,这是一个能够执行范围广泛的程序、语言理解和生成任务的序列到序列模式。PLBART通过解密自动编码,对大量爪哇和Python函数和相关NL文本进行了预先培训。关于英语代码总和的实验、代码生成和7种编程语言的代码翻译表明,PLBART超越或对立的最新模式。此外,关于歧视性任务的实验,例如,方案修复、克隆探测和脆弱代码探测,也证明了PLBART在方案理解中的有效性。此外,分析表明,PLBARRT学习了对编程语法、风格(例如识别符号命名公约)和逻辑流(例如,如果在另一区块内块内,如果阻塞等同于其他语言的话),对于编程的语义至关重要,因此甚至带有有限的语义。