Machine translation has recently achieved impressive performance thanks to recent advances in deep learning and the availability of large-scale parallel corpora. There have been numerous attempts to extend these successes to low-resource language pairs, yet requiring tens of thousands of parallel sentences. In this work, we take this research direction to the extreme and investigate whether it is possible to learn to translate even without any parallel data. We propose a model that takes sentences from monolingual corpora in two different languages and maps them into the same latent space. By learning to reconstruct in both languages from this shared feature space, the model effectively learns to translate without using any labeled data. We demonstrate our model on two widely used datasets and two language pairs, reporting BLEU scores of 32.8 and 15.1 on the Multi30k and WMT English-French datasets, without using even a single parallel sentence at training time.
翻译:最近,由于在深层次的学习和大规模平行连体的可用性方面取得了进步,机器翻译最近取得了令人印象深刻的业绩。人们多次尝试将这些成功推广到低资源语言配对,但需要数以万计的平行句子。在这项工作中,我们将这一研究方向推向极端,并调查是否有可能在没有平行数据的情况下学会翻译。我们提出了一个模式,用两种不同语言从单一语言的连体中取出句子并将其映射到相同的潜在空间。通过学习用两种语言从这个共享的地物空间中重建,该模型在不使用任何标签数据的情况下有效地学会翻译。我们展示了我们关于两种广泛使用的数据集和两种语言配对的模型,在Multi30k和WMT英文-法文数据集上报告BLEU分数32.8和15.1,在培训时甚至不使用一个平行的句子。