In this work, we introduce BanglaBERT, a BERT-based Natural Language Understanding (NLU) model pretrained in Bangla, a widely spoken yet low-resource language in the NLP literature. To pretrain BanglaBERT, we collect 27.5 GB of Bangla pretraining data (dubbed `Bangla2B+') by crawling 110 popular Bangla sites. We introduce two downstream task datasets on natural language inference and question answering and benchmark on four diverse NLU tasks covering text classification, sequence labeling, and span prediction. In the process, we bring them under the first-ever Bangla Language Understanding Benchmark (BLUB). BanglaBERT achieves state-of-the-art results outperforming multilingual and monolingual models. We are making the models, datasets, and a leaderboard publicly available at https://github.com/csebuetnlp/banglabert to advance Bangla NLP.
翻译:在这项工作中,我们引入了孟加拉国语言理解模型(BanglaBERT),这是孟加拉语中一种以Bangla语言为先导的、在荷兰语文献中广泛使用但低资源的语言;在孟加拉语中,我们通过爬行110个受欢迎的孟加拉语站点,收集了27.5GB孟加拉语预培训数据(dubbed `Bangla2B+')。我们引入了两个关于自然语言推论的下游任务数据集,以及针对四种不同的荷兰语任务的问题回答和基准,包括文本分类、序列标签和横幅预测。在此过程中,我们把它们置于有史以来第一个孟加拉语语言理解基准之下。孟加拉语语言理解模型(BLUB)实现最新成果,超越了多语种和单语模式。我们正在将模型、数据集和一个领导板公布在https://github.com/csebuetnlp/banglabert上,以推进孟加拉语NLP。