In this work, we study how the generalization performance of a given direction changes with its sampling ratio in Multilingual Neural Machine Translation (MNMT). By training over 200 multilingual models with various model sizes, directions, and total numbers of tasks, we find that scalarization leads to a multitask trade-off front that deviates from the traditional Pareto front when there exists data imbalance in the training corpus. That is, the performance of certain translation directions does not improve with the increase of its weight in the multi-task optimization objective, which poses a great challenge to improve the overall performance of all directions. Based on our observations, we propose the Double Power Law to predict the unique performance trade-off front in MNMT, which is robust across various languages, data adequacy, and the number of tasks. Finally, we formulate the sample ratio selection problem in MNMT as an optimization problem based on the Double Power Law, which achieves better performance than temperature searching and gradient manipulation methods using up to half of the total training budget in our experiments.
翻译:在这项工作中,我们研究了一个给定方向的泛化性能如何随其采样比率而变化,在多语言神经机器翻译(MNMT)中。通过训练200多个不同模型大小、方向和任务总数的多语言模型,我们发现当训练数据不平衡时,标量化引导了一个任务之间多任务折衷前沿,它偏离了传统的Pareto前沿。也就是说,某些翻译方向的性能不随其在多任务优化目标函数中的权重增加而提高,这对于提高所有方向的整体性能带来了巨大的挑战。基于我们的观察,我们提出了“双重幂律”来预测MNMT中的独特性能折衷前沿,它在各种语言、数据充足性和任务数量中都很稳健。最后,我们将MNMT中的采样比率选择问题建立为基于双重幂律的优化问题,在我们的实验中,这比温度搜索和梯度调整方法使用半数的全部训练预算实现了更好的性能。