This article presents morphologically-annotated Yemeni, Sudanese, Iraqi, and Libyan Arabic dialects Lisan corpora. Lisan features around 1.2 million tokens. We collected the content of the corpora from several social media platforms. The Yemeni corpus (~ 1.05M tokens) was collected automatically from Twitter. The corpora of the other three dialects (~ 50K tokens each) came manually from Facebook and YouTube posts and comments. Thirty five (35) annotators who are native speakers of the target dialects carried out the annotations. The annotators segemented all words in the four corpora into prefixes, stems and suffixes and labeled each with different morphological features such as part of speech, lemma, and a gloss in English. An Arabic Dialect Annotation Toolkit ADAT was developped for the purpose of the annation. The annotators were trained on a set of guidelines and on how to use ADAT. We developed ADAT to assist the annotators and to ensure compatibility with SAMA and Curras tagsets. The tool is open source, and the four corpora are also available online.
翻译:本篇文章展示了也门语、苏丹语、伊拉克语和利比亚阿拉伯语方言的形态说明。利桑语的特征约为120万个象征物。我们从几个社交媒体平台上收集了社团的内容。也门文(~1.05M符号)是从推特上自动收集的。其他三种方言(每个符号~50K符号)的社团来自脸书和YouTube的海报和评论。35个(35个)作为目标方言的本地演讲者进行了说明。我们开发了ADAT来协助4个公司的所有单词,并确保它们与SAMA和Curras标签的兼容性。该工具是开放的,还有4个公司在线。