As there is a scarcity of large representative corpora for most languages, it is important for Multilingual Language Models (MLLM) to extract the most out of existing corpora. In this regard, script diversity presents a challenge to MLLMs by reducing lexical overlap among closely related languages. Therefore, transliterating closely related languages that use different writing scripts to a common script may improve the downstream task performance of MLLMs. In this paper, we pretrain two ALBERT models to empirically measure the effect of transliteration on MLLMs. We specifically focus on the Indo-Aryan language family, which has the highest script diversity in the world. Afterward, we evaluate our models on the IndicGLUE benchmark. We perform Mann-Whitney U test to rigorously verify whether the effect of transliteration is significant or not. We find that transliteration benefits the low-resource languages without negatively affecting the comparatively high-resource languages. We also measure the cross-lingual representation similarity (CLRS) of the models using centered kernel alignment (CKA) on parallel sentences of eight languages from the FLORES-101 dataset. We find that the hidden representations of the transliteration-based model have higher and more stable CLRS scores. Our code is available at Github (github.com/ibraheem-moosa/XLM-Indic) and Hugging Face Hub (huggingface.co/ibraheemmoosa/xlmindic-base-multiscript and huggingface.co/ibraheemmoosa/xlmindic-base-uniscript).
翻译:鉴于大多数语言缺乏大规模代表性语料库,重要的是让多语言语言模型(MLLM)充分利用现有的语料库。在这方面,不同书写脚本的文字多样性对MLLM构成挑战,因为它减少了紧密相关的语言之间的词汇重叠。因此,将使用不同书写脚本的紧密相关语言音译为共同脚本可能会改善MLLM的下游任务性能。本文中,我们预训练了两个ALBERT模型来实证测量音译对MLLM的影响。具体而言,我们专注于印欧语系,这是世界上文字多样性最高的语系。然后,我们在IndicGLUE基准测试中评估了我们的模型。我们使用Mann-Whitney U检验来严格验证是否音译的影响是显著的。我们发现,音译有益于低资源语言,而不会对相对较高资源的语言产生负面影响。我们还使用FLORES-101数据集的八种语言的并行句子,在中心核对齐(CKA)上测量模型的跨语言表示相似度(CLRS)。我们发现,基于音译的模型的隐藏表示具有更高且更稳定的CLRS分数。我们的代码在Github(github.com/ibraheem-moosa/XLM-Indic)和Hugging Face Hub(huggingface.co/ibraheemmoosa/xlmindic-base-multiscript和huggingface.co/ibraheemmoosa/xlmindic-base-uniscript)上可用。