In the recent past, social media platforms have helped people in connecting and communicating to a wider audience. But this has also led to a drastic increase in cyberbullying. It is essential to detect and curb hate speech to keep the sanity of social media platforms. Also, code mixed text containing more than one language is frequently used on these platforms. We, therefore, propose automated techniques for hate speech detection in code mixed text from scraped Twitter. We specifically focus on code mixed English-Hindi text and transformer-based approaches. While regular approaches analyze the text independently, we also make use of content text in the form of parent tweets. We try to evaluate the performances of multilingual BERT and Indic-BERT in single-encoder and dual-encoder settings. The first approach is to concatenate the target text and context text using a separator token and get a single representation from the BERT model. The second approach encodes the two texts independently using a dual BERT encoder and the corresponding representations are averaged. We show that the dual-encoder approach using independent representations yields better performance. We also employ simple ensemble methods to further improve the performance. Using these methods we were able to achieve the best F1 score of 73.07% on the HASOC 2021 ICHCL code mixed data set.
翻译:最近,社交媒体平台帮助人们联系和沟通到更广泛的受众。但这也导致网络欺凌急剧增加。检测和遏制仇恨言论至关重要,以保持社交媒体平台的灵敏性。此外,这些平台经常使用包含多种语言的代码混合文本。因此,我们提议在报废推特的代码混合文本中采用自动技术来检测仇恨言论。我们特别侧重于代码混合英文-印度文文本和基于变压器的方法。在对文本进行独立分析的常规方法中,我们也使用母体推文形式的内容文本。我们试图评估多语种BERT和英德-BERT在单一编码器和双编码环境中的性能。我们的第一个方法是使用分隔符符号对目标文本和背景文本进行配对,从废弃的Twitter模式中获取单一代表。我们用双倍的BERT编码和相应的表达方式对两种文本进行编码。我们用独立表达的双倍分解法方法进一步使用母体推文。我们试图评估多语种BERT和英德-BERT的性能提高性能。我们还利用了FSL1的混合方法改进了我们20的性能方法。我们用了20 %的计算方法。我们还用了这些方法改进了20分解方法。我们实现了。