Recent advancements in technology have led to a boost in social media usage which has ultimately led to large amounts of user-generated data which also includes hateful and offensive speech. The language used in social media is often a combination of English and the native language in the region. In India, Hindi is used predominantly and is often code-switched with English, giving rise to the Hinglish (Hindi+English) language. Various approaches have been made in the past to classify the code-mixed Hinglish hate speech using different machine learning and deep learning-based techniques. However, these techniques make use of recurrence on convolution mechanisms which are computationally expensive and have high memory requirements. Past techniques also make use of complex data processing making the existing techniques very complex and non-sustainable to change in data. We propose a much simpler approach which is not only at par with these complex networks but also exceeds performance with the use of subword tokenization algorithms like BPE and Unigram along with multi-head attention-based technique giving an accuracy of 87.41% and F1 score of 0.851 on standard datasets. Efficient use of BPE and Unigram algorithms help handle the non-conventional Hinglish vocabulary making our technique simple, efficient and sustainable to use in the real world.
翻译:近来的技术进步导致社交媒体使用率的提高,最终导致大量用户生成的数据,其中也包括令人憎恶和冒犯的言论。社交媒体使用的语言往往是英语和该地区本地语言的结合。在印度,印地语主要使用,而且往往与英语编码转换,从而产生了Hinglish语(Hindi+English),过去曾采取各种办法,利用不同的机器学习和深层次的学习技术,对编码混合的Hingish仇恨言论进行分类,但这些技术最终导致大量用户生成的数据,其中也包括令人憎恶和冒犯的言论。然而,这些技术利用计算费用昂贵、记忆要求高的变异机制的重复。过去的技术还利用复杂的数据处理,使现有的技术非常复杂,对数据的变化不可持续。我们提出了一个非常简单的方法,不仅与这些复杂的网络相近,而且超过业绩。我们不仅使用BPE和Unigram等子代号代号算法,而且使用多头关注技术,精确地说明了87.41%和0.851分的标准数据集。在标准数据集中高效地使用BPE和Unigram 等系统,从而有效地利用了我们真实的Glaslistal技术。