The Arabic language is a morphologically rich language with relatively few resources and a less explored syntax compared to English. Given these limitations, Arabic Natural Language Processing (NLP) tasks like Sentiment Analysis (SA), Named Entity Recognition (NER), and Question Answering (QA), have proven to be very challenging to tackle. Recently, with the surge of transformers based models, language-specific BERT based models have proven to be very efficient at language understanding, provided they are pre-trained on a very large corpus. Such models were able to set new standards and achieve state-of-the-art results for most NLP tasks. In this paper, we pre-trained BERT specifically for the Arabic language in the pursuit of achieving the same success that BERT did for the English language. The performance of AraBERT is compared to multilingual BERT from Google and other state-of-the-art approaches. The results showed that the newly developed AraBERT achieved state-of-the-art performance on most tested Arabic NLP tasks. The pretrained araBERT models are publicly available on https://github.com/aub-mind/arabert hoping to encourage research and applications for Arabic NLP.
翻译:阿拉伯语是一种形态丰富的语言,其资源相对较少,与英语相比,其语法探索较少。鉴于这些限制,阿拉伯自然语言处理(NLP)任务,如感知分析(SA)、命名实体识别(NER)和问答(QA)等,证明非常难以应对。最近,随着以变压器为基础的模型的激增,基于语言的BERT模型在语言理解方面证明非常高效,只要在非常大的内容上经过预先培训,这些模型能够为大多数NLP任务制定新标准并实现最新水平的艺术成果。在本文中,我们专门为阿拉伯语对BERT进行了培训,以取得与ERT为英语所做的同样的成功。AraBERT的性能与谷歌和其他最先进的方法的多语言BERT相比。结果显示,新开发的AraBERT在最经过测试的阿拉伯语NLP任务上达到了最先进的状态表现。经过培训的ARBERT模型在https://Githubara.com/au-Lmin 上公开提供,用于支持阿拉伯研究。