The number of increased social media users has led to a lot of people misusing these platforms to spread offensive content and use hate speech. Manual tracking the vast amount of posts is impractical so it is necessary to devise automated methods to identify them quickly. Large language models are trained on a lot of data and they also make use of contextual embeddings. We fine-tune the large language models to help in our task. The data is also quite unbalanced; so we used a modified cross-entropy loss to tackle the issue. We observed that using a model which is fine-tuned in hindi corpora performs better. Our team (HNLP) achieved the macro F1-scores of 0.808, 0.639 in English Subtask A and English Subtask B respectively. For Hindi Subtask A, Hindi Subtask B our team achieved macro F1-scores of 0.737, 0.443 respectively in HASOC 2021.
翻译:社交媒体用户的增加导致许多人误用这些平台传播冒犯性内容和使用仇恨言论。 人工跟踪大量文章是不切实际的, 因此有必要设计自动方法快速识别这些文章。 大语言模型在大量数据上受过培训, 并且它们也使用背景嵌入器。 我们微调了大语言模型来帮助完成我们的任务。 数据也是相当不平衡的; 因此我们用经修改的跨热带损失来解决这个问题。 我们观察到,使用在hindi Corpora 中精确调整的模型,效果更好。 我们的团队(HNLP)分别实现了0. 808、0. 639 和 0. 639 的大型F1核心, 以英语Subtask A 和 English Subtask B 。 对于印地语 Subtask A, 印地语 Subtask B, 我们的团队在HSOC 2021 中分别实现了0.737和0. 443的宏F1分数。