While most machine translation systems to date are trained on large parallel corpora, humans learn language in a different way: by being grounded in an environment and interacting with other humans. In this work, we propose a communication game where two agents, native speakers of their own respective languages, jointly learn to solve a visual referential task. We find that the ability to understand and translate a foreign language emerges as a means to achieve shared goals. The emergent translation is interactive and multimodal, and crucially does not require parallel corpora, but only monolingual, independent text and corresponding images. Our proposed translation model achieves this by grounding the source and target languages into a shared visual modality, and outperforms several baselines on both word-level and sentence-level translation tasks. Furthermore, we show that agents in a multilingual community learn to translate better and faster than in a bilingual communication setting.
翻译:虽然迄今为止大多数机器翻译系统都受过关于大型平行公司的培训,但人类以不同的方式学习语言:在环境中学习,并与其他人类互动。在这项工作中,我们提议了一个交流游戏,让两个代理方,即他们各自语言的当地语言使用者,共同学习解决视觉特长任务。我们发现理解和翻译外语的能力是实现共同目标的一种手段。新兴翻译是互动和多式的,关键是不需要平行公司,而只需要单语、独立文本和相应的图像。我们提议的翻译模型通过将源语言和目标语言定位为共同的视觉模式来实现这一点,并且超越了字级和句级翻译任务的若干基线。此外,我们显示多语言社区中的代理方学会翻译比双语通信环境的更好和更快。