While large language models excel on high-resource multilingual tasks, low- and extremely low-resource Indic languages remain severely under-evaluated. We present IndicParam, a human-curated benchmark of over 13,000 multiple-choice questions covering 11 such languages (Nepali, Gujarati, Marathi, Odia as low-resource; Dogri, Maithili, Rajasthani, Sanskrit, Bodo, Santali, Konkani as extremely low-resource) plus Sanskrit-English code-mixed set. We evaluated 19 LLMs, both proprietary and open-weights, which reveals that even the top-performing GPT-5 reaches only 45.0% average accuracy, followed by DeepSeek-3.2 (43.1) and Claude-4.5 (42.7). We additionally label each question as knowledge-oriented or purely linguistic to discriminate factual recall from grammatical proficiency. Further, we assess the ability of LLMs to handle diverse question formats-such as list-based matching, assertion-reason pairs, and sequence ordering-alongside conventional multiple-choice questions. IndicParam provides insights into limitations of cross-lingual transfer and establishes a challenging benchmark for Indic languages. The dataset is available at https://huggingface.co/datasets/bharatgenai/IndicParam. Scripts to run benchmark are present at https://github.com/ayushbits/IndicParam.
翻译:尽管大语言模型在高资源多语言任务上表现出色,但低资源及极低资源的印度语言仍然严重缺乏评估。我们提出了IndicParam,这是一个人工整理的基准数据集,包含超过13,000道多项选择题,涵盖11种此类语言(低资源语言:尼泊尔语、古吉拉特语、马拉地语、奥里亚语;极低资源语言:多格拉语、迈蒂利语、拉贾斯坦语、梵语、博多语、桑塔利语、孔卡尼语),外加一个梵语-英语代码混合集。我们评估了19个大语言模型,包括专有模型和开源权重模型,结果显示,即使是表现最佳的GPT-5平均准确率也仅为45.0%,其次是DeepSeek-3.2(43.1)和Claude-4.5(42.7)。我们还为每个问题标注了知识导向或纯语言属性,以区分事实记忆与语法熟练度。此外,我们评估了大语言模型处理多样化问题格式的能力,例如基于列表的匹配、断言-理由对和序列排序,以及传统的多项选择题。IndicParam揭示了跨语言迁移的局限性,并为印度语言建立了一个具有挑战性的基准。数据集可在 https://huggingface.co/datasets/bharatgenai/IndicParam 获取。运行基准的脚本位于 https://github.com/ayushbits/IndicParam。