Fine-tuning the entire set of parameters of a large pretrained model has become the mainstream approach for transfer learning. To increase its efficiency and prevent catastrophic forgetting and interference, techniques like adapters and sparse fine-tuning have been developed. Adapters are modular, as they can be combined to adapt a model towards different facets of knowledge (e.g., dedicated language and/or task adapters). Sparse fine-tuning is expressive, as it controls the behavior of all model components. In this work, we introduce a new fine-tuning method with both these desirable properties. In particular, we learn sparse, real-valued masks based on a simple variant of the Lottery Ticket Hypothesis. Task-specific masks are obtained from annotated data in a source language, and language-specific masks from masked language modeling in a target language. Both these masks can then be composed with the pretrained model. Unlike adapter-based fine-tuning, this method neither increases the number of parameters at inference time nor alters the original model architecture. Most importantly, it outperforms adapters in zero-shot cross-lingual transfer by a large margin in a series of multilingual benchmarks, including Universal Dependencies, MasakhaNER, and AmericasNLI. Based on an in-depth analysis, we additionally find that sparsity is crucial to prevent both 1) interference between the fine-tunings to be composed and 2) overfitting. We release the code and models at https://github.com/cambridgeltl/composable-sft.
翻译:精密模型的全部参数的微调已成为转移学习的主流方法。为了提高效率,防止灾难性的遗忘和干扰,已经开发了适应器和微调等技术。适应器是模块化的,因为可以结合它们使模型适应知识的不同方面(如专用语言和/或任务适配器)。粗微微微调是表达的,因为它控制了所有模型组成部分的行为。在这项工作中,我们对这些可取的属性都采用了一种新的微调方法。特别是,我们根据“彩票”滴答调器和零微调的简单变体,学会了稀薄的、真实的面罩。从源语的附加数据中获取了任务专用口罩,从目标语言的遮罩模型中获取了语言专用口罩。这两种口罩都可以与预先培训的模式组合。与基于调整的微调不同,这种方法既不增加可测时间参数的数量,也不改变原始模型结构。最重要的是,我们从零调的跨语系交叉调制面罩口罩面罩,通过一个大比例的源码、跨面码/双向美洲的多语言基准,我们在一个基础分析中找到一个基础的多语言基准,在“我们”的多语言代码”数据库中,在“我们搜索”的多语言/智能模型中找到一个基础,在“多语言”的“双向基础,在“多语言”中找到一个基础,在“双基调调。