Despite being the third most popular language in India, the Marathi language lacks useful NLP resources. Moreover, popular NLP libraries do not have support for the Marathi language. With L3Cube-MahaNLP, we aim to build resources and a library for Marathi natural language processing. We present datasets and transformer models for supervised tasks like sentiment analysis, named entity recognition, and hate speech detection. We have also published a monolingual Marathi corpus for unsupervised language modeling tasks. Overall we present MahaCorpus, MahaSent, MahaNER, and MahaHate datasets and their corresponding MahaBERT models fine-tuned on these datasets. We aim to move ahead of benchmark datasets and prepare useful resources for Marathi. The resources are available at https://github.com/l3cube-pune/MarathiNLP.
翻译:尽管马拉地语是印度第三种最受欢迎的语言,但马拉地语缺乏有用的全国语言方案资源;此外,著名的全国语言方案图书馆没有支持马拉地语;我们的目标是利用L3Cube-MahaNLP,为马拉地语自然语言处理建立资源和图书馆;我们为情感分析、名称实体识别和仇恨言论检测等监督任务提供数据集和变压器模型;我们还为未受监督的语言建模任务出版了单语马拉地文集。我们总体上介绍了马哈科尔普斯、马哈森特、马哈登、马哈内尔和MahaHate数据集及其相应的MahaBERT模型,并对这些数据集进行了微调。我们的目标是推进基准数据集,为Marathi准备有用的资源。这些资源见https://github.com/l3cube-pune/MarathiNLP。