With little to no parallel data available for programming languages, unsupervised methods are well-suited to source code translation. However, the majority of unsupervised machine translation approaches rely on back-translation, a method developed in the context of natural language translation and one that inherently involves training on noisy inputs. Unfortunately, source code is highly sensitive to small changes; a single token can result in compilation failures or erroneous programs, unlike natural languages where small inaccuracies may not change the meaning of a sentence. To address this issue, we propose to leverage an automated unit-testing system to filter out invalid translations, thereby creating a fully tested parallel corpus. We found that fine-tuning an unsupervised model with this filtered data set significantly reduces the noise in the translations so-generated, comfortably outperforming the state-of-the-art for all language pairs studied. In particular, for Java $\to$ Python and Python $\to$ C++ we outperform the best previous methods by more than 16% and 24% respectively, reducing the error rate by more than 35%.
翻译:用于编程语言的平行数据很少甚至根本没有, 无人监督的方法非常适合源代码翻译。 但是, 大多数未经监督的机器翻译方法都依赖于背译, 这是一种在自然语言翻译背景下开发的方法, 也必然涉及对噪音输入的培训。 不幸的是, 源代码对小改动非常敏感; 单一个符号可能导致编译失败或程序错误, 与自然语言不同, 在自然语言中, 小的不准确可能不会改变句子的含义。 为了解决这个问题, 我们提议利用一个自动单位测试系统来过滤无效翻译, 从而创建一个完全测试的平行文件。 我们发现, 微调一个由这种过滤数据组成的未经监督的模式, 大大降低了翻译中产生的噪音, 令人放心地超过了所研究的所有语言对口的状态。 特别是, Java $to$to$ Python 和 Python $$\ to $ $C+, 我们比以往的最佳方法分别高出16%和24%, 使错误率降低35%以上。