Recent research in multilingual language models (LM) has demonstrated their ability to effectively handle multiple languages in a single model. This holds promise for low web-resource languages (LRL) as multilingual models can enable transfer of supervision from high resource languages to LRLs. However, incorporating a new language in an LM still remains a challenge, particularly for languages with limited corpora and in unseen scripts. In this paper we argue that relatedness among languages in a language family may be exploited to overcome some of the corpora limitations of LRLs, and propose RelateLM. We focus on Indian languages, and exploit relatedness along two dimensions: (1) script (since many Indic scripts originated from the Brahmic script), and (2) sentence structure. RelateLM uses transliteration to convert the unseen script of limited LRL text into the script of a Related Prominent Language (RPL) (Hindi in our case). While exploiting similar sentence structures, RelateLM utilizes readily available bilingual dictionaries to pseudo translate RPL text into LRL corpora. Experiments on multiple real-world benchmark datasets provide validation to our hypothesis that using a related language as pivot, along with transliteration and pseudo translation based data augmentation, can be an effective way to adapt LMs for LRLs, rather than direct training or pivoting through English.
翻译:最近在多语言模式(LM)方面的研究表明,他们有能力在一个模式中有效处理多种语言,这为低网络资源语言(LLL)带来了希望,因为多语言模式能够将监管从高资源语言转移到LLLLs。然而,将新语言纳入LM仍是一个挑战,特别是对于只有有限的Corpora语和隐蔽脚本的语言而言。在本文中,我们争辩说,语言家庭语言之间的关联性可能被利用来克服LLLs的一些整体局限性,并提议LelecteLM。 我们注重的是印度语言,利用两个层面的关联性:(1)脚本(因为许多 Indi 脚本源自Brahmic 脚本)和(2)句状结构。DelelemelmM使用翻译将有限LLLL文字的无形文字转换为相关突出语言(RPL)(在我们的案件中,Hindidi)的文字结构类似,LelecteleMM利用现成的双语字典将RPLL文字翻译成LLCora。在多种现实世界基准中进行实验(因为以直接LLMsredistration),通过我们的数据转换为验证,通过直接数据转换为数据,可以提供数据转换。