Multilingual models are parameter-efficient and especially effective in improving low-resource languages by leveraging crosslingual transfer. Despite recent advance in massive multilingual translation with ever-growing model and data, how to effectively train multilingual models has not been well understood. In this paper, we show that a common situation in multilingual training, data imbalance among languages, poses optimization tension between high resource and low resource languages where the found multilingual solution is often sub-optimal for low resources. We show that common training method which upsamples low resources can not robustly optimize population loss with risks of either underfitting high resource languages or overfitting low resource ones. Drawing on recent findings on the geometry of loss landscape and its effect on generalization, we propose a principled optimization algorithm, Curvature Aware Task Scaling (CATS), which adaptively rescales gradients from different tasks with a meta objective of guiding multilingual training to low-curvature neighborhoods with uniformly low loss for all languages. We ran experiments on common benchmarks (TED, WMT and OPUS-100) with varying degrees of data imbalance. CATS effectively improved multilingual optimization and as a result demonstrated consistent gains on low resources ($+0.8$ to $+2.2$ BLEU) without hurting high resources. In addition, CATS is robust to overparameterization and large batch size training, making it a promising training method for massive multilingual models that truly improve low resource languages.
翻译:多语文模式是利用跨语文转让来改进低资源语言的参数效率,特别有效。尽管最近在大规模多语文翻译方面取得了长足进展,而且其模式和数据不断增长,但如何有效培训多语文模式并未得到很好理解。在本文件中,我们表明,多语文培训、各语文之间数据不平衡的共同情况造成高资源和低资源语言之间最优化的紧张关系,因为找到的多语文解决办法往往对低资源而言不尽如人意。我们表明,采用共同培训方法,将低资源作为衡量低资源损失的标准(TED、WMT和OPUS-100),其风险是不符合高资源语言的,或过于适合低资源。根据最近对损失地貌地理分布的调查结果及其对普遍化的影响,我们建议采用有原则的优化算法,即 " 快速了解任务配置 " (CATS),从不同任务的适应性地调整梯度梯度,其元目标是引导多语言培训到所有语文损失程度均很低的低技术社区社区。我们试验了共同基准(TE、WMT和OPUS-100),数据不平衡程度不同。CATS有效地改进了多语文优化,结果表明在低资源上取得了一致的成果,使高资源(+2.2美元至大规模培训方法)真正改善。