The increase in abusive content on online social media platforms is impacting the social life of online users. Use of offensive and hate speech has been making so-cial media toxic. Homophobia and transphobia constitute offensive comments against LGBT+ community. It becomes imperative to detect and handle these comments, to timely flag or issue a warning to users indulging in such behaviour. However, automated detection of such content is a challenging task, more so in Dravidian languages which are identified as low resource languages. Motivated by this, the paper attempts to explore applicability of different deep learning mod-els for classification of the social media comments in Malayalam and Tamil lan-guages as homophobic, transphobic and non-anti-LGBT+content. The popularly used deep learning models- Convolutional Neural Network (CNN), Long Short Term Memory (LSTM) using GloVe embedding and transformer-based learning models (Multilingual BERT and IndicBERT) are applied to the classification problem. Results obtained show that IndicBERT outperforms the other imple-mented models, with obtained weighted average F1-score of 0.86 and 0.77 for Malayalam and Tamil, respectively. Therefore, the present work confirms higher performance of IndicBERT on the given task in selected Dravidian languages.
翻译:在线社交平台上滥用内容增加, 影响了用户的社交生活。攻击性和仇恨言论使得社交媒体变得有毒。恐同和恐Trans构成了对LGBT+社区的攻击性评论。及时检测和处理这些评论变得至关重要,以便及时对沉迷于这种行为的用户发出警告。然而,自动检测这种内容是具有挑战性的任务,尤其是在被认为是低资源语言的德拉维达语中。本文试图探索不同深度学习模型在马拉雅拉姆语和泰米尔语的社交媒体评论分类问题中的适用性,包括常用的卷积神经网络(CNN)、使用GloVe嵌入的长短时记忆网络(LSTM)和基于transformer的学习模型(Multilingual BERT和IndicBERT)。实验结果表明,IndicBERT在所选的德拉维达语言中表现优异,对马拉雅拉姆语和泰米尔语分别获得了加权平均F1分数为0.86和0.77,因此,本研究证实了IndicBERT在所选的德拉维达语言中在给定的任务上具有更高的性能。