Current research on hate speech analysis is typically oriented towards monolingual and single classification tasks. In this paper, we present a new multilingual hate speech analysis dataset for English, Hindi, Arabic, French, German and Spanish languages for multiple domains across hate speech - Abuse, Racism, Sexism, Religious Hate and Extremism. To the best of our knowledge, this paper is the first to address the problem of identifying various types of hate speech in these five wide domains in these six languages. In this work, we describe how we created the dataset, created annotations at high level and low level for different domains and how we use it to test the current state-of-the-art multilingual and multitask learning approaches. We evaluate our dataset in various monolingual, cross-lingual and machine translation classification settings and compare it against open source English datasets that we aggregated and merged for this task. Then we discuss how this approach can be used to create large scale hate-speech datasets and how to leverage our annotations in order to improve hate speech detection and classification in general.
翻译:当前有关仇恨言论分析的研究通常针对单语和单一分类任务。在本文中,我们为英语、印地语、阿拉伯语、法语、德语和西班牙语的多个领域跨仇恨言论 - 虐待、种族主义、性别歧视、宗教仇恨和极端主义提出了一个新的多语言仇恨言论分析数据集。据我们所知,这篇论文是第一篇在这六种语言中解决不同领域中各种类型仇恨言论识别问题的论文。在这项工作中,我们描述了如何创建数据集,为不同领域创建高级别和低级别的注释,以及如何使用它来测试当前最先进的多语言和多任务学习方法。我们评估了我们的数据集在各种单语、跨语言和机器翻译分类设置中,并将其与我们为此任务汇总和合并的英语开放源代码数据集进行比较。然后,我们讨论了如何使用这种方法创建大规模的仇恨言论数据集,以及如何利用我们的注释来改进仇恨言论检测和分类。