The ever growing usage of social media in the recent years has had a direct impact on the increased presence of hate speech and offensive speech in online platforms. Research on effective detection of such content has mainly focused on English and a few other widespread languages, while the leftover majority fail to have the same work put into them and thus cannot benefit from the steady advancements made in the field. In this paper we present \textsc{Shaj}, an annotated Albanian dataset for hate speech and offensive speech that has been constructed from user-generated content on various social media platforms. Its annotation follows the hierarchical schema introduced in OffensEval. The dataset is tested using three different classification models, the best of which achieves an F1 score of 0.77 for the identification of offensive language, 0.64 F1 score for the automatic categorization of offensive types and lastly, 0.52 F1 score for the offensive language target identification.
翻译:近年来,社交媒体的使用不断增加,对网上平台中仇恨言论和攻击性言论的增多产生了直接影响,关于有效发现此类内容的研究主要侧重于英语和其他几种广泛语言,而其余的多数人未能完成同样的工作,因此无法从该领域的稳步进展中受益。在本文中,我们介绍了阿尔巴尼亚语仇恨言论和攻击性言论的附加说明的数据集,该数据集是各种社交媒体平台中用户生成的内容所构建的。该数据集的注解遵循了在阿尔卑斯-埃瓦尔推出的等级结构。该数据集使用三种不同的分类模式进行测试,其中最佳的是在识别冒犯性语言方面达到0.77的F1分,对攻击性语言的自动分类为0.64 F1分,最后是攻击性语言目标识别为0.52 F1分。