In this paper, we leverage low-level compiler intermediate representations (IR) to improve code translation. Traditional transpilers rely on syntactic information and handcrafted rules, which limits their applicability and produces unnatural-looking code. Applying neural machine translation (NMT) approaches to code has successfully broadened the set of programs on which one can get a natural-looking translation. However, they treat the code as sequences of text tokens, and still do not differentiate well enough between similar pieces of code which have different semantics in different languages. The consequence is low quality translation, reducing the practicality of NMT, and stressing the need for an approach significantly increasing its accuracy. Here we propose to augment code translation with IRs, specifically LLVM IR, with results on the C++, Java, Rust, and Go languages. Our method improves upon the state of the art for unsupervised code translation, increasing the number of correct translations by 11% on average, and up to 79% for the Java -> Rust pair with greedy decoding. With beam search, it increases the number of correct translations by 5.5% in average. We extend previous test sets for code translation, by adding hundreds of Go and Rust functions. Additionally, we train models with high performance on the problem of IR decompilation, generating programming source code from IR, and study using IRs as intermediary pivot for translation.
翻译:在本文中,我们利用底层编译器中间表示(IR)来改善代码翻译。传统的翻译器依赖于语法信息和手工制定的规则,这限制了它们的适用性并产生了不自然的代码。将神经机器翻译(NMT)方法应用于代码已成功地扩展了可以得到自然翻译的程序集。然而,它们将代码视为文本标记的序列,仍不能很好地区分在不同语言中具有不同语义的类似代码的差异。其结果是翻译的质量较低,降低了NMT的实用性,并强调了需要显着提高其准确性的方法。在这里,我们建议利用IR(特别是LLVM IR)增强代码翻译,针对C ++,Java,Rust和Go语言的结果。我们的方法改进了无监督代码翻译的现状,平均提高了11%的正确翻译数量,并在Java-> Rust对中提高了最多79%的贪婪解码。使用波束搜索,平均提高了5.5%的正确翻译率。我们通过添加数百个Go和Rust函数,扩展了先前的代码翻译测试集。此外,我们还训练了具有高性能的模型来解决IR反汇编的问题,从IR生成编程源代码,并研究使用IR作为翻译中间枢纽。