使用均衡数据实现多语种翻译的优化 (Robust Optimization for Multilingual Translation with Imbalanced Data)

Multilingual models are parameter-efficient with the prospect improving low-resource languages by leveraging crosslingual transfer. Despite recent advance in massive multilingual translation with ever-growing model and data, how to effectively train multilingual models has not been well understood. In this paper, we show that a common situation in multilingual training, data imbalance among languages, poses optimization tension between high resource and low resource languages where the found multilingual solution is often sub-optimal for low resources. We show that common training method which upsamples low resources can not robustly optimize population loss with risks of either underfitting high resource languages or overfitting low resource ones. Drawing on recent findings on the geometry of loss landscape and its effect on generalization, we propose a principled optimization algorithm, Curvature Aware Task Scaling (CATS), which adaptively rescales gradients from different tasks with a meta objective of guiding multilingual training to low-curvature neighborhoods with uniformly low loss for all languages. We ran experiments on common benchmarks (TED, WMT and OPUS-100) with varying degrees of data imbalance. CATS effectively improved multilingual optimization and as a result demonstrated consistent gains on low resources ( to BLEU) without hurting high resources. In addition, CATS is robust to overparameterization and large batch size training, making it a promising training method for massive multilingual models that truly improve low resource languages.

翻译：多语种模式与利用跨语言传输手段改善低资源语言的前景相比,具有参数效率的多语种模式模式模式模式与数据不断增长,尽管在大规模多语种翻译方面取得了进步,但如何有效培训多语种模式并未得到很好理解。在本文件中,我们表明,多语种培训方面的共同情况,即语言之间数据不平衡,造成高资源和低资源语言之间最优化的紧张关系,而找到的多语种解决方案对于低资源而言往往不尽人意。我们表明,在利用高资源语言不足或过度配置低资源语言的风险的情况下,增加低资源的共同培训方法无法强有力地优化人口损失。根据最近关于损失地貌的几何结果及其对普遍化的影响,我们建议采用有原则的优化算法,即 " 缩小了解任务范围 " (CATS),从不同任务中适应性地调整梯度的梯度,其元目标就是指导多语种培训到所有语言损失程度均很低的低知识社区社区。我们进行了共同基准(TE、WMT和OPUS-100)试验,数据不平衡程度不尽相同。CATS有效地改进了多语言优化,结果显示在低资源(到BLEU)的大规模培训方法上取得持续的成绩。