While Indic NLP has made rapid advances recently in terms of the availability of corpora and pre-trained models, benchmark datasets on standard NLU tasks are limited. To this end, we introduce IndicXNLI, an NLI dataset for 11 Indic languages. It has been created by high-quality machine translation of the original English XNLI dataset and our analysis attests to the quality of IndicXNLI. By finetuning different pre-trained LMs on this IndicXNLI, we analyze various cross-lingual transfer techniques with respect to the impact of the choice of language models, languages, multi-linguality, mix-language input, etc. These experiments provide us with useful insights into the behaviour of pre-trained models for a diverse set of languages.
翻译:虽然印度国家语言数据库最近在公司和事先培训模式的可用性方面取得了快速进展,但标准国家语言数据库任务的基准数据集有限,为此,我们引入了11种印度语言的国家语言数据库数据集,即国家语言数据库数据集IndicXNLI,这是由英文XNLI原始数据集的高质量机器翻译和我们的分析证明印度语言数据库的质量所创建的。我们通过微调关于这一 IndicXNLI的不同预先培训的LM,分析了各种跨语言传输技术,涉及语言模式的选择、语言、多语言质量、混合语言投入等的影响。这些实验为我们提供了关于多种语言的预培训模式行为的有用洞察力。