In the current era of the internet, where social media platforms are easily accessible for everyone, people often have to deal with threats, identity attacks, hate, and bullying due to their association with a cast, creed, gender, religion, or even acceptance or rejection of a notion. Existing works in hate speech detection primarily focus on individual comment classification as a sequence labeling task and often fail to consider the context of the conversation. The context of a conversation often plays a substantial role when determining the author's intent and sentiment behind the tweet. This paper describes the system proposed by team MIDAS-IIITD for HASOC 2021 subtask 2, one of the first shared tasks focusing on detecting hate speech from Hindi-English code-mixed conversations on Twitter. We approach this problem using neural networks, leveraging the transformer's cross-lingual embeddings and further finetuning them for low-resource hate-speech classification in transliterated Hindi text. Our best performing system, a hard voting ensemble of Indic-BERT, XLM-RoBERTa, and Multilingual BERT, achieved a macro F1 score of 0.7253, placing us first on the overall leaderboard standings.
翻译:在当前互联网时代,人人都可以很容易地利用社交媒体平台,人们往往不得不面对威胁、身份攻击、仇恨和欺凌,因为他们与演员、信仰、性别、宗教,甚至接受或拒绝某个概念有关联。现有的仇恨言论探测工作主要侧重于个人评论分类,将其作为一个顺序标签任务,而且往往不考虑谈话的背景。在确定作者在推文背后的意图和情绪时,对话的背景往往发挥很大的作用。本文描述了MIDAS-IIITD团队为HasOC 2021 subtask 2提议的系统,这是首次共同承担的任务之一,重点是从Twitter上的印地语-英语代码混合对话中发现仇恨言论。我们利用神经网络处理这一问题,利用变压器的跨语言嵌入器,进一步微调它们,用于翻译印度文文本中的低资源仇恨语言分类。我们的最佳表现系统,印度文-BERT、XLM-ROBERT和多语版BERT的硬票组合组合,实现了0.7253分的宏观F1分数,将我们置于总体领导板上。