Social networking platforms provide a conduit to disseminate our ideas, views and thoughts and proliferate information. This has led to the amalgamation of English with natively spoken languages. Prevalence of Hindi-English code-mixed data (Hinglish) is on the rise with most of the urban population all over the world. Hate speech detection algorithms deployed by most social networking platforms are unable to filter out offensive and abusive content posted in these code-mixed languages. Thus, the worldwide hate speech detection rate of around 44% drops even more considering the content in Indian colloquial languages and slangs. In this paper, we propose a methodology for efficient detection of unstructured code-mix Hinglish language. Fine-tuning based approaches for Hindi-English code-mixed language are employed by utilizing contextual based embeddings such as ELMo (Embeddings for Language Models), FLAIR, and transformer-based BERT (Bidirectional Encoder Representations from Transformers). Our proposed approach is compared against the pre-existing methods and results are compared for various datasets. Our model outperforms the other methods and frameworks.
翻译:社交网络平台为传播我们的思想、观点和想法以及信息提供了渠道,这导致英语与母语合并。印地语-英语代码混合数据(Hinglish)的普及率随着全世界大多数城市人口的增加而不断上升。大多数社交网络平台部署的仇恨言论检测算法无法过滤这些代码混合语言中张贴的冒犯和滥用内容。因此,考虑到印度语和语类的语种,全世界约44%的仇恨言论检测率更低。我们在此文件中提出了高效检测非结构化代码混合语言的方法。对印地语-英语代码混合语言的优化方法,通过使用基于背景的嵌入法,如ELMO(语言模型的床位)、FLAIR(FLAIR)和基于变压器的BERT(变压器的BERT)等。我们提出的方法与先前存在的方法和结果进行了比较,并比较了各种数据集。我们的模型超越了其他方法和框架。