We present L3Cube-MahaCorpus a Marathi monolingual data set scraped from different internet sources. We expand the existing Marathi monolingual corpus with 24.8M sentences and 289M tokens. We further present, MahaBERT, MahaAlBERT, and MahaRoBerta all BERT-based masked language models, and MahaFT, the fast text word embeddings both trained on full Marathi corpus with 752M tokens. We show the effectiveness of these resources on downstream classification and NER tasks. Marathi is a popular language in India but still lacks these resources. This work is a step forward in building open resources for the Marathi language. The data and models are available at https://github.com/l3cube-pune/MarathiNLP .
翻译:我们从不同互联网来源提供L3Cube-MahaCorpus的Marathi单语数据集,我们扩大了现有的Marathi单语集,增加了24.8M个句子和289M象征性物,我们进一步介绍了MahaBERT、MahaALBERT和MahaRoBert所有基于BERT的蒙面语言模式,以及MahaFT,这两个快速文本词都包含在完全的Marathipory上,有752M象征性物。我们展示了这些资源在下游分类和NER任务方面的有效性。Marathi是印度流行的语言,但仍缺乏这些资源。这是在为Marathi语言建立开放资源方面向前迈出的一步。数据和模型见https://github.com/l3cube-pune/MarathinNLP。