In this work, we introduce IndicXTREME, a benchmark consisting of nine diverse tasks covering 18 languages from the Indic sub-continent belonging to four different families. Across languages and tasks, IndicXTREME contains a total of 103 evaluation sets, of which 51 are new contributions to the literature. To maintain high quality, we only use human annotators to curate or translate\footnote{for IndicXParaphrase, where an automatic translation system is used, a second human verification and correction step is done.} our datasets. To the best of our knowledge, this is the first effort toward creating a standard benchmark for Indic languages that aims to test the zero-shot capabilities of pretrained language models. We also release IndicCorp v2, an updated and much larger version of IndicCorp that contains 20.9 billion tokens in 24 languages. We pretrain IndicBERT v2 on IndicCorp v2 and evaluate it on IndicXTREME to show that it outperforms existing multilingual language models such as XLM-R and MuRIL.
翻译:在这项工作中,我们引入了IndicXTREME, 这个基准有九种不同的任务,涵盖印度分局属于四个不同家庭的不同语言的18种语言。 跨语言和任务, IndicXTREME 包含总共103套评价, 其中51套是对文献的新贡献。 为了保持高质量, 我们只使用人工识别器来翻译或翻译注解{ IndicXParashase, 在使用自动翻译系统的情况下, 完成了第二个人类核查和校正步骤。 } 我们的数据集。 根据我们的知识, 这是为Indic语言建立一个标准基准的第一项努力, 目的是测试预先培训的语言模型的零发能力。 我们还发布了IndicCorp v2, 这是一种更新的、大得多的IndicCorp, 含有24种语言的209亿个符号。 我们预先准备IndicCorp v2 并评估IndicXTREME, 以显示它比 XLM 和 MuRIL 等现有的多语言模型。