While prior work has established that the use of parallel data is conducive for cross-lingual learning, it is unclear if the improvements come from the data itself, or if it is the modeling of parallel interactions that matters. Exploring this, we examine the usage of unsupervised machine translation to generate synthetic parallel data, and compare it to supervised machine translation and gold parallel data. We find that even model generated parallel data can be useful for downstream tasks, in both a general setting (continued pretraining) as well as the task-specific setting (translate-train), although our best results are still obtained using real parallel data. Our findings suggest that existing multilingual models do not exploit the full potential of monolingual data, and prompt the community to reconsider the traditional categorization of cross-lingual learning approaches.
翻译:虽然先前的工作已经确定,使用平行数据有助于跨语言学习,但尚不清楚改进是否来自数据本身,或者是否是平行互动的模型。探讨这一问题,我们研究使用未经监督的机器翻译生成合成平行数据的情况,并将其与监督的机器翻译和黄金平行数据进行比较。我们发现,即使是模型生成的平行数据也可以对下游任务有用,既包括一般工作(继续培训前),也包括任务特定环境(翻译培训),尽管我们的最佳成果仍然是使用真正的平行数据取得的。我们的调查结果表明,现有的多语言模式并没有充分利用单一语言数据的潜力,促使社区重新考虑传统的跨语言学习方法分类。