Recent advancements in technology have led to a boost in social media usage which has ultimately led to large amounts of user-generated data which also includes hateful and offensive speech. The language used in social media is often a combination of English and the native language in the region. In India, Hindi is used predominantly and is often code-switched with English, giving rise to the Hinglish (Hindi+English) language. Various approaches have been made in the past to classify the code-mixed Hinglish hate speech using different machine learning and deep learning-based techniques. However, these techniques make use of recurrence on convolution mechanisms which are computationally expensive and have high memory requirements. Past techniques also make use of complex data processing making the existing techniques very complex and non-sustainable to change in data. Proposed work gives a much simpler approach which is not only at par with these complex networks but also exceeds performance with the use of subword tokenization algorithms like BPE and Unigram, along with multi-head attention-based techniques, giving an accuracy of 87.41% and an F1 score of 0.851 on standard datasets. Efficient use of BPE and Unigram algorithms help handle the nonconventional Hinglish vocabulary making the proposed technique simple, efficient and sustainable to use in the real world.
翻译:近年来,技术的进步引发了社交媒体使用量的增加,最终导致大量包括仇恨和攻击性言论的用户生成数据。社交媒体中使用的语言通常是英语和当地语言的组合。在印度,印地语被广泛使用,通常与英语混合使用,形成了Hinglish(印地语+英语)语言。过去已经尝试了各种不同的机器学习和深度学习方法来分类Hinglish混合码仇恨言论。然而,这些技术使用的是递归或卷积机制,具有高计算和记忆开销。过去的技术还使用了复杂的数据处理方法,使得现有技术非常复杂,不可持续应对数据变化。提出的方法采用了更简单的方法,不仅不逊于这些复杂网络,而且在使用BPE和Unigram等子词分词算法和基于多头注意力的技术的情况下,能够提高性能,在标准数据集上获得了87.41%的准确度和0.851的F1得分。BPE和Unigram算法的有效使用有助于处理非常规的Hinglish词汇,从而使所建议的技术简单、高效、可持续用于实际应用场景。