The complete freedom of expression in social media has its costs especially in spreading harmful and abusive content that may induce people to act accordingly. Therefore, the need of detecting automatically such a content becomes an urgent task that will help and enhance the efficiency in limiting this toxic spread. Compared to other Arabic dialects which are mostly based on MSA, the Tunisian dialect is a combination of many other languages like MSA, Tamazight, Italian and French. Because of its rich language, dealing with NLP problems can be challenging due to the lack of large annotated datasets. In this paper we are introducing a new annotated dataset composed of approximately 10k of comments. We provide an in-depth exploration of its vocabulary through feature engineering approaches as well as the results of the classification performance of machine learning classifiers like NB and SVM and deep learning models such as ARBERT, MARBERT and XLM-R.
翻译:社交媒体完全的言论自由成本高昂,特别是传播有害和滥用内容,可能促使人们采取相应行动。因此,自动发现这类内容成为一项紧迫任务,有助于提高限制这种有毒传播的效率。与其他主要基于特派任务生活津贴的阿拉伯语方言相比,突尼斯方言是许多其他语言的结合,如MIS、Tamazight、意大利语和法语。由于语言丰富,处理NLP问题可能具有挑战性,因为缺乏大量附加说明的数据集。本文我们引入了一套新的附加说明的数据集,由大约10公里的评论组成。我们通过地物工程方法以及诸如NB和SVM等机器学习分类师的分类工作成果以及ABERT、MARBERT和XLM-R等深层次学习模式,深入探索其词汇。