The monolingual Hindi BERT models currently available on the model hub do not perform better than the multi-lingual models on downstream tasks. We present L3Cube-HindBERT, a Hindi BERT model pre-trained on Hindi monolingual corpus. Further, since Indic languages, Hindi and Marathi share the Devanagari script, we train a single model for both languages. We release DevBERT, a Devanagari BERT model trained on both Marathi and Hindi monolingual datasets. We evaluate these models on downstream Hindi and Marathi text classification and named entity recognition tasks. The HindBERT and DevBERT-based models show superior performance compared to their multi-lingual counterparts. These models are shared at https://huggingface.co/l3cube-pune .
翻译:目前在模型枢纽上提供的单语印地语BERT模型在下游任务方面没有比多语种模型更好的表现。我们介绍了L3Cube-HindBERT,这是印地语BERT在印地语单语版上预先培训的模型。此外,由于印地语、印地语和马拉地语共享Devanagari文字,我们为这两种语言培训了单一模式。我们发布了DevBERT,这是Devanagari BERT模式,在马拉地语和印地语单语数据集方面都受过培训。我们评估了下游印地语和马拉地语文本分类和名称实体识别任务的这些模型。基于HindBERT和DevBERT的模型显示优于其多种语言对应方。这些模型在https://huggingface.co/l3cube-pune中共享。