We present NN-Rank, an algorithm for ranking source languages for cross-lingual transfer, which leverages hidden representations from multilingual models and unlabeled target-language data. We experiment with two pretrained multilingual models and two tasks: part-of-speech tagging (POS) and named entity recognition (NER). We consider 51 source languages and evaluate on 56 and 72 target languages for POS and NER, respectively. When using in-domain data, NN-Rank beats state-of-the-art baselines that leverage lexical and linguistic features, with average improvements of up to 35.56 NDCG for POS and 18.14 NDCG for NER. As prior approaches can fall back to language-level features if target language data is not available, we show that NN-Rank remains competitive using only the Bible, an out-of-domain corpus available for a large number of languages. Ablations on the amount of unlabeled target data show that, for subsets consisting of as few as 25 examples, NN-Rank produces high-quality rankings which achieve 92.8% of the NDCG achieved using all available target data for ranking.
翻译:本文提出NN-Rank算法,该算法利用多语言模型的隐藏表示和未标注的目标语言数据,对跨语言迁移的源语言进行排序。我们采用两种预训练多语言模型和两项任务进行实验:词性标注(POS)和命名实体识别(NER)。实验涵盖51种源语言,并分别在56种和72种目标语言上评估POS和NER任务。当使用领域内数据时,NN-Rank在利用词汇和语言特征的先进基线方法上取得显著优势,POS任务的平均NDCG提升最高达35.56,NER任务平均提升最高达18.14。由于现有方法在目标语言数据缺失时可依赖语言层面特征,我们证明NN-Rank仅使用《圣经》这一覆盖大量语言的领域外语料时仍保持竞争力。对未标注目标数据量的消融实验表明,即使仅使用25个样本的极小子集,NN-Rank仍能生成高质量排序结果,其NDCG可达使用全部目标数据排序时NDCG的92.8%。