利用自动单元自动测试进行不受监督的代码翻译 (Leveraging Automated Unit Tests for Unsupervised Code Translation)

With little to no parallel data available for programming languages, unsupervised methods are well-suited to source code translation. However, the majority of unsupervised machine translation approaches rely on back-translation, a method developed in the context of natural language translation and one that inherently involves training on noisy inputs. Unfortunately, source code is highly sensitive to small changes; a single token can result in compilation failures or erroneous programs, unlike natural languages where small inaccuracies may not change the meaning of a sentence. To address this issue, we propose to leverage an automated unit-testing system to filter out invalid translations, thereby creating a fully tested parallel corpus. We found that fine-tuning an unsupervised model with this filtered data set significantly reduces the noise in the translations so-generated, comfortably outperforming the state-of-the-art for all language pairs studied. In particular, for Java $\to$ Python and Python $\to$ C++ we outperform the best previous methods by more than 16% and 24% respectively, reducing the error rate by more than 35%.

翻译：用于编程语言的平行数据很少甚至根本没有, 无人监督的方法非常适合源代码翻译。但是, 大多数未经监督的机器翻译方法都依赖于背译, 这是一种在自然语言翻译背景下开发的方法, 也必然涉及对噪音输入的培训。不幸的是, 源代码对小改动非常敏感; 单一个符号可能导致编译失败或程序错误, 与自然语言不同, 在自然语言中, 小的不准确可能不会改变句子的含义。为了解决这个问题, 我们提议利用一个自动单位测试系统来过滤无效翻译, 从而创建一个完全测试的平行文件。我们发现, 微调一个由这种过滤数据组成的未经监督的模式, 大大降低了翻译中产生的噪音, 令人放心地超过了所研究的所有语言对口的状态。特别是, Java $to$to$ Python 和 Python $$\ to $ $C+, 我们比以往的最佳方法分别高出16%和24%, 使错误率降低35%以上。

相关内容

Automator

关注 5

Automator是苹果公司为他们的Mac OS X系统开发的一款软件。 只要通过点击拖拽鼠标等操作就可以将一系列动作组合成一个工作流，从而帮助你自动的（可重复的）完成一些复杂的工作。Automator还能横跨很多不同种类的程序，包括：查找器、Safari网络浏览器、iCal、地址簿或者其他的一些程序。它还能和一些第三方的程序一起工作，如微软的Office、Adobe公司的Photoshop或者Pixelmator等。

新书《图数据科学傻瓜式入门》，53页pdf

专知会员服务

116+阅读 · 2020年11月27日