Standard fine-tuning of language models typically performs well on in-distribution data, but suffers with generalization to distribution shifts. In this work, we aim to improve generalization of adapter-based cross-lingual task transfer where such cross-language distribution shifts are imminent. We investigate scheduled unfreezing algorithms -- originally proposed to mitigate catastrophic forgetting in transfer learning -- for fine-tuning task adapters in cross-lingual transfer. Our experiments show that scheduled unfreezing methods close the gap to full fine-tuning and achieve state-of-the-art transfer performance, suggesting that these methods can go beyond just mitigating catastrophic forgetting. Next, aiming to delve deeper into those empirical findings, we investigate the learning dynamics of scheduled unfreezing using Fisher Information. Our in-depth experiments reveal that scheduled unfreezing induces different learning dynamics compared to standard fine-tuning, and provide evidence that the dynamics of Fisher Information during training correlate with cross-lingual generalization performance. We additionally propose a general scheduled unfreezing algorithm that achieves an average of 2 points improvement over four datasets compared to standard fine-tuning and provides strong empirical evidence for a theory-based justification of the heuristic unfreezing schedule (i.e., the heuristic schedule is implicitly maximizing Fisher Information). Our code will be publicly available.
翻译:语言模式的标准微调通常在分配数据方面表现良好,但一般化到分布变化。在这项工作中,我们的目标是在跨语言分布变化即将到来的地方,改进基于适应器的跨语言任务转移任务转移的普及性。我们调查预定的解冻算法 -- -- 最初是为了减轻转移过程中灾难性的遗忘而提出的,目的是在跨语言转移中微调任务调整适应者。我们的实验表明,预定的解冻方法缩小了差距,以全面微调和达到最先进的转移性能,表明这些方法可以超越仅仅减轻灾难性的遗忘。接下来,为了更深入了解这些经验性调查结果,我们调查利用渔业信息预定解冻任务转移的学习动态。我们深入的实验显示,预定的解冻算法 -- -- 最初是为了减轻转移过程中的灾难性遗忘 -- -- 而不是标准的微调 -- -- 提供证据证明,在培训期间,渔业信息动态与跨语言的概括化性通用性化性工作表现有关。我们还提议,一般的预定的解冻算法方法可以达到平均2点,比标准微调改进4个数据集,并提供强有力的实证证证据证据,说明如何利用渔业信息进行理论化。