The emergent cross-lingual transfer seen in multilingual pretrained models has sparked significant interest in studying their behavior. However, because these analyses have focused on fully trained multilingual models, little is known about the dynamics of the multilingual pretraining process. We investigate when these models acquire their in-language and cross-lingual abilities by probing checkpoints taken from throughout XLM-R pretraining, using a suite of linguistic tasks. Our analysis shows that the model achieves high in-language performance early on, with lower-level linguistic skills acquired before more complex ones. In contrast, when the model learns to transfer cross-lingually depends on the language pair. Interestingly, we also observe that, across many languages and tasks, the final, converged model checkpoint exhibits significant performance degradation and that no one checkpoint performs best on all languages. Taken together with our other findings, these insights highlight the complexity and interconnectedness of multilingual pretraining.
翻译:在多语种预先培训模式中出现的跨语言转让,激发了人们对研究其行为的极大兴趣。然而,由于这些分析侧重于经过充分培训的多语言模式,对多语种预先培训过程的动态知之甚少。当这些模式获得其语言和跨语言能力时,我们调查这些模式何时通过使用一套语言任务从整个XLM-R预培训阶段抽取的检查点获得其语言和跨语言能力。我们的分析表明,该模式在早期就取得了高语言表现,在较复杂的培训之前获得了较低语言技能。相反,当模型学会跨语言转让时,则取决于对语言。有趣的是,我们还观察到,在许多语言和任务中,最终的、趋同的示范检查站表现严重退化,而且没有一个检查站对所有语言都表现最佳。连同我们的其他研究结果,这些洞察显示多语言预培训的复杂性和相互联系。