Multilingual models jointly pretrained on multiple languages have achieved remarkable performance on various multilingual downstream tasks. Moreover, models finetuned on a single monolingual downstream task have shown to generalize to unseen languages. In this paper, we first show that it is crucial for those tasks to align gradients between them in order to maximize knowledge transfer while minimizing negative transfer. Despite its importance, the existing methods for gradient alignment either have a completely different purpose, ignore inter-task alignment, or aim to solve continual learning problems in rather inefficient ways. As a result of the misaligned gradients between tasks, the model suffers from severe negative transfer in the form of catastrophic forgetting of the knowledge acquired from the pretraining. To overcome the limitations, we propose a simple yet effective method that can efficiently align gradients between tasks. Specifically, we perform each inner-optimization by sequentially sampling batches from all the tasks, followed by a Reptile outer update. Thanks to the gradients aligned between tasks by our method, the model becomes less vulnerable to negative transfer and catastrophic forgetting. We extensively validate our method on various multi-task learning and zero-shot cross-lingual transfer tasks, where our method largely outperforms all the relevant baselines we consider.
翻译:对多种语言共同培训的多语文模式在多种多语种下游任务方面取得了显著成绩。此外,对单一单语下游任务进行微调的模型显示,这些模型已经普遍化为隐性语言。在本文件中,我们首先表明,对于这些任务而言,关键是要将梯度对齐,以便最大限度地实现知识转让,同时尽量减少负转移。尽管其重要性很大,但现有的梯度调整方法或者具有完全不同的目的,忽视了任务之间的对齐,或者旨在以相当低效率的方式解决持续学习问题。由于任务之间的梯度偏差,该模型受到严重的负面转移,其形式是灾难性地忘记了从培训前获得的知识。为了克服这些限制,我们提出了一个简单而有效的方法,可以有效地将任务之间的梯度对齐。具体地说,我们通过对所有任务进行按顺序抽样的组合来进行每个内部优化,然后进行反向外更新。由于我们的方法对任务之间的梯度进行了调整,该模型变得不那么容易受到负面转移和灾难性的遗忘。我们广泛验证了我们关于各种多任务学习和零向跨语言转移任务的方法的方法,我们的方法基本上超越了所有基准。