In this work, we introduce IndicXTREME, a benchmark consisting of nine diverse tasks covering 18 languages from the Indic sub-continent belonging to four different families. Across languages and tasks, IndicXTREME contains a total of 103 evaluation sets, of which 51 are new contributions to the literature. To maintain high quality, we only use human annotators to curate or translate our datasets. To the best of our knowledge, this is the first effort toward creating a standard benchmark for Indic languages that aims to test the zero-shot capabilities of pretrained language models. We also release IndicCorp v2, an updated and much larger version of IndicCorp that contains 20.9 billion tokens in 24 languages. We pretrain IndicBERT v2 on IndicCorp v2 and evaluate it on IndicXTREME to show that it outperforms existing multilingual language models such as XLM-R and MuRIL.
翻译:在这项工作中,我们引入了IndicXTREME, 这个基准有九种不同任务,涵盖印度分局属于四个不同家庭的18种语言。在语言和任务之间,IndicXTREME包含总共103套评价,其中51套是对文献的新贡献。为了保持高质量,我们只使用人工识别器来翻译或翻译我们的数据集。根据我们的最佳知识,这是为印度分局语言建立一个标准基准的首次努力,目的是测试预先培训的语言模型的零发能力。我们还发布了IndicCorp v2, 这是一种更新的、大得多的IndicCorp, 含有209亿种24种语言的标志。我们关于IndicCorp v2的Indic IndicBERT v2, 并评估IndicXTREME, 以显示它超过了现有的多语言模型,例如XLM-R和MuRIL。