In this paper, we present HS-BAN, a binary class hate speech (HS) dataset in Bangla language consisting of more than 50,000 labeled comments, including 40.17% hate and rest are non hate speech. While preparing the dataset a strict and detailed annotation guideline was followed to reduce human annotation bias. The HS dataset was also preprocessed linguistically to extract different types of slang currently people write using symbols, acronyms, or alternative spellings. These slang words were further categorized into traditional and non-traditional slang lists and included in the results of this paper. We explored traditional linguistic features and neural network-based methods to develop a benchmark system for hate speech detection for the Bangla language. Our experimental results show that existing word embedding models trained with informal texts perform better than those trained with formal text. Our benchmark shows that a Bi-LSTM model on top of the FastText informal word embedding achieved 86.78% F1-score. We will make the dataset available for public use.
翻译:在本文中,我们以孟加拉语展示了HS-BAN, 一种二元级仇恨言论(HS-BAN)数据集,由50 000多条贴标签的评论组成,包括40.17%的仇恨和休息是非仇恨言论。在准备数据集时,遵循了严格和详细的批注准则,以减少人类笔记偏见。HS数据集还预先在语言上进行了处理,以提取目前人们使用符号、缩略语或替代拼写方式书写的不同类型的 sang 。这些词被进一步归类为传统和非传统的标签清单,并列入本文件的结果。我们探讨了传统语言特征和神经网络型网络型方法,以制定孟加拉语仇恨言论检测基准系统。我们的实验结果显示,经过非正式文本培训的现有词嵌入模式比经过正式文本培训的要好。我们的基准显示,在快速图文非正式词嵌入上的一个Bi-LSTM模型已经达到86.78% F1核心。我们将将数据集提供给公众使用。