In this work, we provide a large-scale empirical study of the scaling properties of multilingual neural machine translation models. We examine how increases in the model size affect the model performance and investigate the role of the training mixture composition on the scaling behavior. We find that changing the weightings of the individual language pairs in the training mixture only affect the multiplicative factor of the scaling law. In particular, we observe that multilingual models trained using different mixing rates all exhibit the same scaling exponent. Through a novel joint scaling law formulation, we compute the effective number of parameters allocated to each language pair and examine the role of language similarity in the scaling behavior of our models. We find little evidence that language similarity has any impact. In contrast, the direction of the multilinguality plays a significant role, with models translating from multiple languages into English having a larger number of effective parameters per task than their reversed counterparts. Finally, we leverage our observations to predict the performance of multilingual models trained with any language weighting at any scale, significantly reducing efforts required for language balancing in large multilingual models. Our findings apply to both in-domain and out-of-domain test sets and to multiple evaluation metrics, such as ChrF and BLEURT.
翻译:在这项工作中,我们提供对多语言神经机翻译模型规模属性的大规模经验研究;我们研究模型规模的增加如何影响模型性能,并调查培训混合构成对比例化行为的作用;我们发现,在培训混合中改变单个语言配对的权重只会影响比例化法的倍增效应;特别是,我们观察到,使用不同混合率培训的多语言模型都表现出同样的比例化速度;通过新颖的联合比例化法制定,我们计算分配给每种语言对应方的有效参数数量,并研究语言在比例化模式行为中的作用;我们发现几乎没有证据表明语言相似性有任何影响;相比之下,多语言性的方向起着重要作用,模式从多种语言转换成英语,每个任务的有效参数比反向对应方要多。最后,我们利用我们的观察来预测经过任何规模语言加权培训的多语言模型的性能,大大降低在大型多语言模式中平衡语言所需的努力。我们的调查结果适用于内部和外部语言测试组以及多指标组,例如CF和BF。