Language models based on the Transformer architecture have achieved state-of-the-art performance on a wide range of NLP tasks such as text classification, question-answering, and token classification. However, this performance is usually tested and reported on high-resource languages, like English, French, Spanish, and German. Indian languages, on the other hand, are underrepresented in such benchmarks. Despite some Indian languages being included in training multilingual Transformer models, they have not been the primary focus of such work. In order to evaluate the performance on Indian languages specifically, we analyze these language models through extensive experiments on multiple downstream tasks in Hindi, Bengali, and Telugu language. Here, we compare the efficacy of fine-tuning model parameters of pre-trained models against that of training a language model from scratch. Moreover, we empirically argue against the strict dependency between the dataset size and model performance, but rather encourage task-specific model and method selection. We achieve state-of-the-art performance on Hindi and Bengali languages for text classification task. Finally, we present effective strategies for handling the modeling of Indian languages and we release our model checkpoints for the community : https://huggingface.co/neuralspace-reverie.
翻译:以变异器结构为基础的语言模型在诸如文本分类、问答和象征性分类等大量NLP任务中取得了最先进的表现。然而,这种表现通常在英语、法语、西班牙语和德语等高资源语言上进行测试和报告。另一方面,印度语言在这些基准中的代表性不足。尽管在多语言变异器模型的培训中包括了一些印度语言,但它们并不是这种工作的主要重点。为了具体评估印度语言的绩效,我们通过对印地语、孟加拉语和泰鲁古语的多个下游任务进行广泛试验来分析这些语言模型。在这里,我们比较了预先培训模式的微调模型参数与从零开始培训语言模型的功效。此外,我们从经验上争论数据集大小和模型性能之间的严格依赖性,而是鼓励具体任务模式和方法的选择。我们实现了印地语和孟加拉语的状态,用于文本分类任务。最后,我们提出了处理印度语言建模的有效战略,我们向社区开放了示范检查站:https://spacephace-coverial。