Cross-lingual transfer learning without labeled target language data or parallel text has been surprisingly effective in zero-shot cross-lingual classification, question answering, unsupervised machine translation, etc. However, some recent publications have claimed that domain mismatch prevents cross-lingual transfer, and their results show that unsupervised bilingual lexicon induction (UBLI) and unsupervised neural machine translation (UNMT) do not work well when the underlying monolingual corpora come from different domains (e.g., French text from Wikipedia but English text from UN proceedings). In this work, we show that a simple initialization regimen can overcome much of the effect of domain mismatch in cross-lingual transfer. We pre-train word and contextual embeddings on the concatenated domain-mismatched corpora, and use these as initializations for three tasks: MUSE UBLI, UN Parallel UNMT, and the SemEval 2017 cross-lingual word similarity task. In all cases, our results challenge the conclusions of prior work by showing that proper initialization can recover a large portion of the losses incurred by domain mismatch.
翻译:然而,最近的一些出版物声称,域错配防止了跨语文的转移,其结果表明,如果基础的单一语言翻译团来自不同领域(例如维基百科的法文文本,但联合国议事录的英文文本),在没有标注的目标语言数据或平行文本的情况下,跨语言的跨语言转移学习在零分跨语分类、问答、无人监督的机器翻译等方面效果惊人。 在这项工作中,我们表明,简单的初始化制度可以克服跨语言传输中域错配影响的大部分影响。 我们的预排词和背景嵌入了连接的域错配公司,并将这些词和背景嵌入了三个任务:MUSE UBLI、联合国平行的UNMT和SemEval 2017年跨语言相似性任务。我们的结果对先前工作的结论提出了挑战,因为适当初始化可以恢复因域错配对而造成的损失的大部分。