Prior work has shown that it is possible to expand pretrained Masked Language Models (MLMs) to new languages by learning a new set of embeddings, while keeping the transformer body frozen. Despite learning a small subset of parameters, this approach is not compute-efficient, as training the new embeddings requires a full forward and backward pass over the entire model. In this work, we propose mini-model adaptation, a compute-efficient alternative that builds a shallow mini-model from a fraction of a large model's parameters. New language-specific embeddings can then be efficiently trained over the mini-model, and plugged into the aligned large model for rapid cross-lingual transfer. We explore two approaches to learn mini-models: MiniJoint, which jointly pretrains the primary model and the mini-model using a single transformer with a secondary MLM head at a middle layer; and MiniPost, where we start from a regular pretrained model and build a mini-model by extracting and freezing a few layers and learning a small number of parameters on top. Experiments on XNLI, MLQA and PAWS-X show that mini-model adaptation matches the performance of the standard approach using up to 2.4x less compute.
翻译:先前的工作表明,有可能通过学习一套新的嵌入式嵌入式,将事先经过训练的隐蔽语言模型(MLM)扩大到新语言,同时将变压器冻结起来。尽管学习了一小部分参数,但这一方法没有计算效率,因为新嵌入式的培训需要对整个模型进行全面的前向和后向传。在这项工作中,我们建议采用微型模型的调整,即一种从一个大模型参数的一小部分建立一种浅度小型模型的计算效率替代方案。然后,新的特定语言嵌入器可以在微型模型中进行有效培训,并插入一个匹配的大型快速跨语言传输模型。我们探索了两种学习微型模型的方法:MiniUnit(MiniUnit),它共同预演初级模型和微型模型,在中间层使用一个二级MLMM头的单一变压器;和MiniPost(MiniPost),我们从一个常规的预先训练型号模型开始,通过提取和冻结几个层次和在顶部学习少量参数来构建一个微型模型。在X级模型上进行实验,在X级的模型上使用低度的模型的测试,用标准方法对模型进行较低的适应。