The social media platform is a convenient medium to express personal thoughts and share useful information. It is fast, concise, and has the ability to reach millions. It is an effective place to archive thoughts, share artistic content, receive feedback, promote products, etc. Despite having numerous advantages these platforms have given a boost to hostile posts. Hate speech and derogatory remarks are being posted for personal satisfaction or political gain. The hostile posts can have a bullying effect rendering the entire platform experience hostile. Therefore detection of hostile posts is important to maintain social media hygiene. The problem is more pronounced languages like Hindi which are low in resources. In this work, we present approaches for hostile text detection in the Hindi language. The proposed approaches are evaluated on the Constraint@AAAI 2021 Hindi hostility detection dataset. The dataset consists of hostile and non-hostile texts collected from social media platforms. The hostile posts are further segregated into overlapping classes of fake, offensive, hate, and defamation. We evaluate a host of deep learning approaches based on CNN, LSTM, and BERT for this multi-label classification problem. The pre-trained Hindi fast text word embeddings by IndicNLP and Facebook are used in conjunction with CNN and LSTM models. Two variations of pre-trained multilingual transformer language models mBERT and IndicBERT are used. We show that the performance of BERT based models is best. Moreover, CNN and LSTM models also perform competitively with BERT based models.
翻译:社交媒体平台是表达个人思想和分享有用信息的方便媒体。 它既快速、简洁,又有能力达到数百万人。 它是一个将思想归档、分享艺术内容、接受反馈、推广产品等的有效场所。 尽管这些平台具有诸多优势,但是这些平台还是刺激了敌对的话题。 仇恨言论和贬损性言论被张贴是为了个人满意或政治利益。 敌对职位可能会产生欺凌效应,导致整个平台出现敌意。 因此, 发现敌对职位对于维护社交媒体卫生很重要。 问题在于印地语等资源较少的更显著语言。 在这项工作中,我们提出了印地语中敌对文本检测的方法。 提议的方法在Cstraint@AAI 2021印地语敌对性检测数据集上得到评估。 数据集由从社交媒体平台收集的敌对和非敌对性言论组成。 敌对性言论可能会进一步被分割成一系列相互重叠的假、攻击、仇恨和诽谤。 我们评估了一组基于CNN、LSTM和BER的深层次语言快速文字字词, 以IMER的模型和MERM的模型用于最佳变型。