In recent years, monitoring hate speech and offensive language on social media platforms has become paramount due to its widespread usage among all age groups, races, and ethnicities. Consequently, there have been substantial research efforts towards automated detection of such content using Natural Language Processing (NLP). While successfully filtering textual data, no research has focused on detecting hateful content in multimedia data. With increased ease of data storage and the exponential growth of social media platforms, multimedia content proliferates the internet as much as text data. Nevertheless, it escapes the automatic filtering systems. Hate speech and offensiveness can be detected in multimedia primarily via three modalities, i.e., visual, acoustic, and verbal. Our preliminary study concluded that the most essential features in classifying hate speech would be the speaker's emotional state and its influence on the spoken words, therefore limiting our current research to these modalities. This paper proposes the first multimodal deep learning framework to combine the auditory features representing emotion and the semantic features to detect hateful content. Our results demonstrate that incorporating emotional attributes leads to significant improvement over text-based models in detecting hateful multimedia content. This paper also presents a new Hate Speech Detection Video Dataset (HSDVD) collected for the purpose of multimodal learning as no such dataset exists today.
翻译:近些年来,对社交媒体平台上的仇恨言论和冒犯性语言进行监测已经变得至关重要,因为它在所有年龄组、种族和族裔中广泛使用,因此,在利用自然语言处理(NLP)自动检测这类内容方面,已经做出了大量研究努力,目的是利用自然语言处理(NLP)自动检测这类内容。虽然成功地过滤了文本数据,但没有研究侧重于检测多媒体数据中的可憎内容。随着数据存储的更加容易和社交媒体平台的指数增长,多媒体内容与文字数据一样,也使互联网更加活跃。然而,它却逃离自动过滤系统。主要通过三种模式,即视觉、声学和口头,在多媒体中可以发现仇恨言论和冒犯性。我们的初步研究得出结论认为,对仇恨言论进行分类的最基本特征将是发言者的情绪状态及其对口头语言的影响,因此将我们目前的研究限于这些模式。本文提出了第一个多式深层次的学习框架,将体现情感和语义特征的听力特征与内容结合起来。我们的成果表明,在发现仇恨多媒体内容时,情感属性特征导致基于文本模型的显著改进。本文还介绍了当前收集的仇恨图像数据的目的,例如MDMDD数据。