Large multilingual pretrained language models (mPLMs) have become the de facto state of the art for cross-lingual transfer in NLP. However, their large-scale deployment to many languages, besides pretraining data scarcity, is also hindered by the increase in vocabulary size and limitations in their parameter budget. In order to boost the capacity of mPLMs to deal with low-resource and unseen languages, we explore the potential of leveraging transliteration on a massive scale. In particular, we explore the UROMAN transliteration tool, which provides mappings from UTF-8 to Latin characters for all the writing systems, enabling inexpensive romanization for virtually any language. We first focus on establishing how UROMAN compares against other language-specific and manually curated transliterators for adapting multilingual PLMs. We then study and compare a plethora of data- and parameter-efficient strategies for adapting the mPLMs to romanized and non-romanized corpora of 14 diverse low-resource languages. Our results reveal that UROMAN-based transliteration can offer strong performance for many languages, with particular gains achieved in the most challenging setups: on languages with unseen scripts and with limited training data without any vocabulary augmentation. Further analyses reveal that an improved tokenizer based on romanized data can even outperform non-transliteration-based methods in the majority of languages.
翻译:大型多语言预训练语言模型(mPLM)已经成为自然语言处理中跨语言迁移的事实上的最新技术。然而,它们在许多语言的大规模部署中,除了预训练数据稀缺之外,还受到词汇量增加和参数预算限制的限制。为了增强 mPLM 处理低资源和未知语言的能力,我们探索了大规模转写的潜力。特别是,我们使用 UROMAN 转写工具,为所有书写系统提供从 UTF-8 到拉丁字符的映射,从而实现几乎所有语言的廉价罗马字母转写。我们首先着重于建立 UROMAN 与其他语言特定的和手动编辑的转写器相比,以适应多语言 PLM。然后,我们研究和比较了许多数据和参数有效的策略,以适应 14 种不同低资源语言的转写和非转写语料库。我们的结果表明,基于 UROMAN 的转写可以为许多语言提供强大的性能,特别是在最具挑战性的设置中取得了成效:在未知脚本和限制训练数据而不进行任何词汇扩充的语言上。进一步的分析表明,基于罗马字母转写的改进的分词器甚至可以在大多数语言中优于非转写方法。