For many (minority) languages, the resources needed to train large models are not available. We investigate the performance of zero-shot transfer learning with as little data as possible, and the influence of language similarity in this process. We retrain the lexical layers of four BERT-based models using data from two low-resource target language varieties, while the Transformer layers are independently fine-tuned on a POS-tagging task in the model's source language. By combining the new lexical layers and fine-tuned Transformer layers, we achieve high task performance for both target languages. With high language similarity, 10MB of data appears sufficient to achieve substantial monolingual transfer performance. Monolingual BERT-based models generally achieve higher downstream task performance after retraining the lexical layer than multilingual BERT, even when the target language is included in the multilingual model.
翻译:对于许多(少数)语言来说,培训大型模型所需的资源是没有的。我们用尽可能少的数据调查零点传输学习的绩效以及语言相似性在这一过程中的影响。我们利用两个低资源目标语言品种的数据对四个基于BERT的模型的词汇层进行再培训,而变异层则独立地根据模型源语言的POS拖累任务进行微调。通过将新的词汇层和微调的变异层结合起来,我们取得了两种目标语言的高任务性能。随着语言的高度相似性,1 000MB的数据似乎足以实现大量的单语转移性能。基于单语的BERT模型在对词汇层进行再培训后通常比多语种BERT取得更高的下游任务性能,即使目标语言被纳入多语模式。