Multilingual pretrained representations generally rely on subword segmentation algorithms to create a shared multilingual vocabulary. However, standard heuristic algorithms often lead to sub-optimal segmentation, especially for languages with limited amounts of data. In this paper, we take two major steps towards alleviating this problem. First, we demonstrate empirically that applying existing subword regularization methods(Kudo, 2018; Provilkov et al., 2020) during fine-tuning of pre-trained multilingual representations improves the effectiveness of cross-lingual transfer. Second, to take full advantage of different possible input segmentations, we propose Multi-view Subword Regularization (MVR), a method that enforces the consistency between predictions of using inputs tokenized by the standard and probabilistic segmentations. Results on the XTREME multilingual benchmark(Hu et al., 2020) show that MVR brings consistent improvements of up to 2.5 points over using standard segmentation algorithms.
翻译:多语言预先培训的表述通常依赖亚字分解算法来创建共同的多语种词汇。然而,标准的超自然算法往往导致亚最佳分解,特别是对于数据数量有限的语言。在本文件中,我们采取了两个主要步骤来缓解这一问题。首先,我们从经验上证明,在微调培训前多语种表述法时,应用现有的子字正规化方法(Kudo, 2018年;Provilkov等人,2020年)提高了跨语言传输的有效性。第二,为了充分利用不同可能的输入分解,我们提议多观点分解法(MVR),这是一种在使用标准分解法和概率分解法(Hu等人,2020年)所象征的投入预测之间实现一致性的方法。XTREME多语言基准(Hu等人,2020年)的结果显示,MVR在使用标准分解算法的基础上,持续改进了2.5个点。