语文在多语种语言模式微调多语种语言模式中的作用:印度-亚利安语言案例研究 (Role of Language Relatedness in Multilingual Fine-tuning of Language Models: A Case Study in Indo-Aryan Languages)

We explore the impact of leveraging the relatedness of languages that belong to the same family in NLP models using multilingual fine-tuning. We hypothesize and validate that multilingual fine-tuning of pre-trained language models can yield better performance on downstream NLP applications, compared to models fine-tuned on individual languages. A first of its kind detailed study is presented to track performance change as languages are added to a base language in a graded and greedy (in the sense of best boost of performance) manner; which reveals that careful selection of subset of related languages can significantly improve performance than utilizing all related languages. The Indo-Aryan (IA) language family is chosen for the study, the exact languages being Bengali, Gujarati, Hindi, Marathi, Oriya, Punjabi and Urdu. The script barrier is crossed by simple rule-based transliteration of the text of all languages to Devanagari. Experiments are performed on mBERT, IndicBERT, MuRIL and two RoBERTa-based LMs, the last two being pre-trained by us. Low resource languages, such as Oriya and Punjabi, are found to be the largest beneficiaries of multilingual fine-tuning. Textual Entailment, Entity Classification, Section Title Prediction, tasks of IndicGLUE and POS tagging form our test bed. Compared to monolingual fine tuning we get relative performance improvement of up to 150% in the downstream tasks. The surprise take-away is that for any language there is a particular combination of other languages which yields the best performance, and any additional language is in fact detrimental.

翻译：我们利用多语种微调,探索利用国家语言方案模式中属于同一家庭的语言相关性的杠杆作用;我们假设并验证,与对个别语言进行微调的模式相比,对经过预先培训的语言模式进行多语种微调可以在下游国家语言方案应用中产生更好的业绩;我们首先进行详细研究,跟踪表现变化,因为语言以分数和贪婪的方式添加到一种基本语言(从最佳性能的意义上讲),从而跟踪业绩变化;这显示,仔细选择相关语言的子群比使用所有相关语言可以大大改善业绩。为研究选择印度阿里亚语(IA)语系,准确的语言是孟加拉语、古吉拉地语、印地语、马拉地语、奥里亚语、旁遮普语和乌尔都语。根据规则简单将所有语言的文本转换到德瓦纳加里语系(从最佳性能提升到业绩),在MBERT、IndiBERT、MuRIL和两个基于罗贝塔语系的LMMMS,最后两个语言正在预先培训。低资源语言,例如奥里亚和旁遮比亚语种语言,准确语言(Oriiya和旁遮普尔比)的准确语言系统,在Silal Adalalalal 格式中进行最佳的升级,这是我们最高级版本版本版本版本版本版本版本版本的版本的成绩,我们最高级版本的版本的版本的版本的版本,其格式,其格式,其最高版本的版本的版本。