Social media platforms and online streaming services have spawned a new breed of Hate Speech (HS). Due to the massive amount of user-generated content on these sites, modern machine learning techniques are found to be feasible and cost-effective to tackle this problem. However, linguistically diverse datasets covering different social contexts in which offensive language is typically used are required to train generalizable models. In this paper, we identify the shortcomings of existing Bangla HS datasets and introduce a large manually labeled dataset BD-SHS that includes HS in different social contexts. The labeling criteria were prepared following a hierarchical annotation process, which is the first of its kind in Bangla HS to the best of our knowledge. The dataset includes more than 50,200 offensive comments crawled from online social networking sites and is at least 60% larger than any existing Bangla HS datasets. We present the benchmark result of our dataset by training different NLP models resulting in the best one achieving an F1-score of 91.0%. In our experiments, we found that a word embedding trained exclusively using 1.47 million comments from social media and streaming sites consistently resulted in better modeling of HS detection in comparison to other pre-trained embeddings. Our dataset and all accompanying codes is publicly available at github.com/naurosromim/hate-speech-dataset-for-Bengali-social-media
翻译:社交媒体平台和在线流流服务催生了新型仇恨言论(HS)。由于这些网站的用户生成内容数量庞大,现代机器学习技术被认为可行且具有成本效益,可以解决这一问题。然而,培训通用模型需要包含通常使用冒犯语言的不同社会背景的语言多样性数据集。在本文件中,我们找出现有孟加拉HS数据集的缺点,并引入了大型手工标签数据集BD-SHS,其中包括不同社会背景下的HS。由于这些网站的用户生成内容数量庞大,因此,现代机器学习技术被认为可行且具有成本效益。然而,为了解决这一问题,现代机器学习技术被认为可行且具有成本效益。但是,为了培训通用模式模型,我们通过培训不同的NLP模型得出基准结果,使最佳模型达到91.0%的F1-S-SS。我们实验发现,专门使用孟加拉HLHS类H470万种评论的升级词嵌入了我们社交媒体和滚动数据库中的所有模型。我们不断改进了S级前/滚动数据库的模型。