In this paper we present the final result of a project on Tunisian Arabic encoded in Arabizi, the Latin-based writing system for digital conversations. The project led to the creation of two integrated and independent resources: a corpus and a NLP tool created to annotate the former with various levels of linguistic information: word classification, transliteration, tokenization, POS-tagging, lemmatization. We discuss our choices in terms of computational and linguistic methodology and the strategies adopted to improve our results. We report on the experiments performed in order to outline our research path. Finally, we explain why we believe in the potential of these resources for both computational and linguistic researches. Keywords: Tunisian Arabizi, Annotated Corpus, Neural Network Architecture
翻译:在本文中,我们介绍了一个以阿拉伯文编码的突尼斯阿拉伯文项目的最后结果,该项目是拉丁数字对话书写系统,它导致创建了两个综合和独立的资源:一个实体和一个国家语言方案工具,用来说明前者拥有不同层次的语言信息:文字分类、转写、象征性、POS标记、脱钩、脱皮。我们讨论了我们在计算和语言方法方面的选择,以及为改善结果而采取的战略。我们报告了为概述我们的研究路径而进行的实验。最后,我们解释了我们为什么相信这些资源对于计算和语言研究的潜力。关键词:突尼斯阿拉伯、注解公司、神经网络结构。