Multilingual Transformer improves parameter efficiency and crosslingual transfer. How to effectively train multilingual models has not been well studied. Using multilingual machine translation as a testbed, we study optimization challenges from loss landscape and parameter plasticity perspectives. We found that imbalanced training data poses task interference between high and low resource languages, characterized by nearly orthogonal gradients for major parameters and the optimization trajectory being mostly dominated by high resource. We show that local curvature of the loss surface affects the degree of interference, and existing heuristics of data subsampling implicitly reduces the sharpness, although still face a trade-off between high and low resource languages. We propose a principled multi-objective optimization algorithm, Curvature Aware Task Scaling (CATS), which improves both optimization and generalization especially for low resource. Experiments on TED, WMT and OPUS-100 benchmarks demonstrate that CATS advances the Pareto front of accuracy while being efficient to apply to massive multilingual settings at the scale of 100 languages.
翻译:多语言变换器提高了参数效率和跨语言传输的参数效率。 如何有效培训多语言模型还没有很好地研究。 使用多语言机器翻译作为测试台,我们研究了损失景观和参数可塑性视角带来的优化挑战。 我们发现,不平衡的培训数据在高资源语言和低资源语言之间造成了任务干扰,主要参数的特征是近正方形梯度,优化轨道大多以高资源为主。 我们显示,损失表面的本地曲线会影响干扰程度,而现有的数据再抽样学隐含地降低了清晰度,尽管仍然面临着高资源语言和低资源语言之间的权衡。 我们提出了一条原则性多目标优化算法(Curvature Se知任务缩放 CATS ), 改善优化和一般化,特别是对低资源而言。 关于TED、WMT和OPUS-100基准的实验表明,CATS提高了Pareto的准确度前沿,同时有效地适用于100种语言规模的大规模多语言环境。