孟加拉语中的仇恨言论检测:数据集及其基线评估 (Hate Speech detection in the Bengali language: A dataset and its baseline evaluation)

Social media sites such as YouTube and Facebook have become an integral part of everyone's life and in the last few years, hate speech in the social media comment section has increased rapidly. Detection of hate speech on social media websites faces a variety of challenges including small imbalanced data sets, the findings of an appropriate model and also the choice of feature analysis method. further more, this problem is more severe for the Bengali speaking community due to the lack of gold standard labelled datasets. This paper presents a new dataset of 30,000 user comments tagged by crowd sourcing and varified by experts. All the comments are collected from YouTube and Facebook comment section and classified into seven categories: sports, entertainment, religion, politics, crime, celebrity and TikTok & meme. A total of 50 annotators annotated each comment three times and the majority vote was taken as the final annotation. Nevertheless, we have conducted base line experiments and several deep learning models along with extensive pre-trained Bengali word embedding such as Word2Vec, FastText and BengFastText on this dataset to facilitate future research opportunities. The experiment illustrated that although all deep learning models performed well, SVM achieved the best result with 87.5% accuracy. Our core contribution is to make this benchmark dataset available and accessible to facilitate further research in the field of in the field of Bengali hate speech detection.

翻译：社会媒体网站(如YouTube和Facebook)已成为每个人生活的一个组成部分,过去几年来,社交媒体评论部分中的仇恨言论迅速增加。在社交媒体网站上发现仇恨言论面临各种挑战,包括小型的不平衡数据集、适当模式的发现和特征分析方法的选择。更严重的是,由于缺少标有标有金牌标签的数据集,这一问题对孟加拉语社区更为严重。本文展示了30,000个用户评论的新数据集,这些用户评论由众包标注,并被专家过滤。所有评论都从YouTube和Facebook评论部分收集,分类为七个类别:体育、娱乐、宗教、政治、犯罪、名人和TikTok & Meme。总共50个注解者,每次注解3次,多数选票作为最后注。然而,我们进行了基础线实验和若干深层次学习模型,以及大量经过事先训练的孟加拉语词,如Word2Vec、FastText和BengFastFastText, 以方便未来研究的机会。实验表明,尽管所有深度的模型都为87号核心的探测领域提供了最佳的精确性,但我们在Bengal 5 的实地进行了最精确的学习。

相关内容

数据集

关注 88

数据集，又称为资料集、数据集合或资料集合，是一种由数据所组成的集合。
Data set（或dataset）是一个数据的集合，通常以表格形式出现。每一列代表一个特定变量。每一行都对应于某一成员的数据集的问题。它列出的价值观为每一个变量，如身高和体重的一个物体或价值的随机数。每个数值被称为数据资料。对应于行数，该数据集的数据可能包括一个或多个成员。

神经常微分方程教程，50页ppt，A brief tutorial on Neural ODEs

专知会员服务

74+阅读 · 2020年8月2日

Linux导论，Introduction to Linux，96页ppt

专知会员服务

82+阅读 · 2020年7月26日

【东大-UCSB】虚假新闻检测的自然语言处理研究综述，A Survey on Natural Language Processing for Fake News Detection

专知会员服务

79+阅读 · 2020年2月12日

【跨语言BERT模型大集合】Transfer learning is increasingly going multilingual with language-specific BERT models

专知会员服务

54+阅读 · 2020年1月30日