With the exponential rise in user-generated web content on social media, the proliferation of abusive languages towards an individual or a group across the different sections of the internet is also rapidly increasing. It is very challenging for human moderators to identify the offensive contents and filter those out. Deep neural networks have shown promise with reasonable accuracy for hate speech detection and allied applications. However, the classifiers are heavily dependent on the size and quality of the training data. Such a high-quality large data set is not easy to obtain. Moreover, the existing data sets that have emerged in recent times are not created following the same annotation guidelines and are often concerned with different types and sub-types related to hate. To solve this data sparsity problem, and to obtain more global representative features, we propose a Convolution Neural Network (CNN) based multi-task learning models (MTLs)\footnote{code is available at https://github.com/imprasshant/STL-MTL} to leverage information from multiple sources. Empirical analysis performed on three benchmark datasets shows the efficacy of the proposed approach with the significant improvement in accuracy and F-score to obtain state-of-the-art performance with respect to the existing systems.
翻译:随着社交媒体上用户生成的网络内容的急剧上升,滥用语言对互联网不同部分的个人或群体的扩散也在迅速增加,对人体主持人来说,查明攻击性内容和过滤这些内容是极具挑战性的;深神经网络已经表现出希望,在仇恨言论检测和相关应用方面,具有合理的准确性;然而,分类者严重依赖培训数据的规模和质量;如此高质量的大型数据集不容易获得;此外,近期出现的现有数据集并非根据同样的说明准则创建,而且往往涉及与仇恨有关的不同类型和子类型。为了解决这一数据弥漫问题,并获得更具全球代表性的特征,我们提议建立一个基于多任务学习模型(MTLs)的 Convolucation Neural网络(CNN){代码,可在https://github.com/imprasshant/STL-MTL}网站上查阅,以利用多种来源的信息。在三个基准数据集上进行的Empical分析显示拟议方法的效力,在准确性和业绩方面有了显著改进。