India is a multilingual society with 1369 rationalized languages and dialects being spoken across the country (INDIA, 2011). Of these, the 22 scheduled languages have a staggering total of 1.17 billion speakers and 121 languages have more than 10,000 speakers (INDIA, 2011). India also has the second largest (and an ever growing) digital footprint (Statista, 2020). Despite this, today's state-of-the-art multilingual systems perform suboptimally on Indian (IN) languages. This can be explained by the fact that multilingual language models (LMs) are often trained on 100+ languages together, leading to a small representation of IN languages in their vocabulary and training data. Multilingual LMs are substantially less effective in resource-lean scenarios (Wu and Dredze, 2020; Lauscher et al., 2020), as limited data doesn't help capture the various nuances of a language. One also commonly observes IN language text transliterated to Latin or code-mixed with English, especially in informal settings (for example, on social media platforms) (Rijhwani et al., 2017). This phenomenon is not adequately handled by current state-of-the-art multilingual LMs. To address the aforementioned gaps, we propose MuRIL, a multilingual LM specifically built for IN languages. MuRIL is trained on significantly large amounts of IN text corpora only. We explicitly augment monolingual text corpora with both translated and transliterated document pairs, that serve as supervised cross-lingual signals in training. MuRIL significantly outperforms multilingual BERT (mBERT) on all tasks in the challenging cross-lingual XTREME benchmark (Hu et al., 2020). We also present results on transliterated (native to Latin script) test sets of the chosen datasets and demonstrate the efficacy of MuRIL in handling transliterated data.
翻译:印度是一个多语言社会,共有1,369种合理化语言和方言在全国使用(INDIA,2011年)。其中,22种计划语言共有11.7亿种语言,121种语言有超过10,000种语言(INDIA,2011年)。印度也是第二大数字足迹(Statista,2020年)。尽管如此,当今最先进的多语言系统对印度语进行了亚光化处理。其原因可以解释为多语言模式(LMS)常常在100+语言上一起接受培训,导致语言在其词汇和培训数据中代表较少。多语言语言语言在资源阅读情景(Wu和Dredze,2020年;Lauscher等人,2020年)方面成效显著下降,因为有限的数据无助于捕捉到一种语言的细微调。同样常见的是语言文本与拉丁语的翻译,特别是在非正式环境中(例如社交媒体平台上)(Rijhwanni 等人,201717年)。这个现象不是由高语言版本的多语言数据库数据库(LREM) 和高语言版本的文本进行充分的翻译。