The processing of the Arabic language is a complex field of research. This is due to many factors, including the complex and rich morphology of Arabic, its high degree of ambiguity, and the presence of several regional varieties that need to be processed while taking into account their unique characteristics. When its dialects are taken into account, this language pushes the limits of NLP to find solutions to problems posed by its inherent nature. It is a diglossic language; the standard language is used in formal settings and in education and is quite different from the vernacular languages spoken in the different regions and influenced by older languages that were historically spoken in those regions. This should encourage NLP specialists to create dialect-specific corpora such as the Palestinian morphologically annotated Curras corpus of Birzeit University. In this work, we present the Lebanese Corpus Baladi that consists of around 9.6K morphologically annotated tokens. Since Lebanese and Palestinian dialects are part of the same Levantine dialectal continuum, and thus highly mutually intelligible, our proposed corpus was constructed to be used to (1) enrich Curras and transform it into a more general Levantine corpus and (2) improve Curras by solving detected errors.
翻译:阿拉伯语的处理是一个复杂的研究领域,其原因很多,包括阿拉伯语的复杂和丰富形态,其高度模糊性,以及存在需要处理的几种区域品种,同时考虑到其独特性。如果考虑到方言,这种语言会推动国家语言规划的局限性,以找到解决其固有性质造成的问题的办法。这是一种奇特的语言;标准语言在正规环境和教育中使用,与不同区域所讲的当地语言有很大不同,并受到这些区域历来使用的老语言的影响。这应该鼓励国家语言规划组织的专家创建具体方言的团,如巴勒斯坦形态上附加说明的伯泽伊特大学库拉斯文。在这项工作中,我们介绍了黎巴嫩Corpus Baladi, 由大约9.6K的形态上附加说明的标志组成。由于黎巴嫩和巴勒斯坦方言是相同的莱夫坦提语方言连结的一部分,因此具有高度的相互理解性,因此,我们提议的体言状是用来:(1) 丰富Currasas,通过检测到更普遍的里夫斯理学改进。