This paper describes neural models developed for the Hate Speech and Offensive Content Identification in English and Indo-Aryan Languages Shared Task 2021. Our team called neuro-utmn-thales participated in two tasks on binary and fine-grained classification of English tweets that contain hate, offensive, and profane content (English Subtasks A & B) and one task on identification of problematic content in Marathi (Marathi Subtask A). For English subtasks, we investigate the impact of additional corpora for hate speech detection to fine-tune transformer models. We also apply a one-vs-rest approach based on Twitter-RoBERTa to discrimination between hate, profane and offensive posts. Our models ranked third in English Subtask A with the F1-score of 81.99% and ranked second in English Subtask B with the F1-score of 65.77%. For the Marathi tasks, we propose a system based on the Language-Agnostic BERT Sentence Embedding (LaBSE). This model achieved the second result in Marathi Subtask A obtaining an F1 of 88.08%.
翻译:本文介绍了为英语和印地安-亚利安语言中的仇恨言论和攻击性内容识别共同任务2021年研发的神经模型。我们称为神经-丁字塔的团队参与了关于包含仇恨、冒犯和冒犯内容的英文推文二进制和精细分类的两项任务(英文Subtaxk A & B),以及一项在马拉地语中识别有问题内容的任务(Marathi Subtask A)。对于英语子任务,我们调查了在微调变压器模型中增加仇恨言论检测组合对微调变异器的影响。我们还根据Twitter-RobERTA对仇恨、波发和攻击性文章之间的歧视采用了一等反向方法。我们的模型在英文Subtask A中排名第三位,F1分数为81.99%,在英文Subtask B中排名第二,F1分数为65.77%。关于马拉地任务,我们提议在语言-Agnotic BERT判刑嵌入床模式(LABSE)的基础上建立一个系统。我们还采用了基于Twith-RobT-ROBE的一等变换模式,在Marath A中取得了第二个结果。在Marath A获得88-088%的F1的F.08。