What can pre-trained multilingual sequence-to-sequence models like mBART contribute to translating low-resource languages? We conduct a thorough empirical experiment in 10 languages to ascertain this, considering five factors: (1) the amount of fine-tuning data, (2) the noise in the fine-tuning data, (3) the amount of pre-training data in the model, (4) the impact of domain mismatch, and (5) language typology. In addition to yielding several heuristics, the experiments form a framework for evaluating the data sensitivities of machine translation systems. While mBART is robust to domain differences, its translations for unseen and typologically distant languages remain below 3.0 BLEU. In answer to our title's question, mBART is not a low-resource panacea; we therefore encourage shifting the emphasis from new models to new data.
翻译:诸如MBART等预先培训的多语种序列到顺序模型能够有助于低资源语言的翻译吗?我们用10种语言进行彻底的经验实验,以弄清这一点,同时考虑到五个因素:(1)微调数据的数量,(2)微调数据的噪音,(3)模型中培训前数据的数量,(4)域错配的影响,(5)语言类型学,除了产生若干超自然学外,实验还形成了一个评估机器翻译系统数据敏感性的框架。虽然MBART对域差异非常活跃,但其对隐蔽语言和排行遥远语言的翻译仍然低于3.0 BLEU。回答我们的标题问题,MBART并不是一种低资源灵丹妙药;因此,我们鼓励将重点从新模型转向新数据。