Toxicity is pervasive in social media and poses a major threat to the health of online communities. The recent introduction of pre-trained language models, which have achieved state-of-the-art results in many NLP tasks, has transformed the way in which we approach natural language processing. However, the inherent nature of pre-training means that they are unlikely to capture task-specific statistical information or learn domain-specific knowledge. Additionally, most implementations of these models typically do not employ conditional random fields, a method for simultaneous token classification. We show that these modifications can improve model performance on the Toxic Spans Detection task at SemEval-2021 to achieve a score within 4 percentage points of the top performing team.
翻译:毒性在社交媒体中十分普遍,对在线社区的健康构成重大威胁。最近引入了培训前语言模式,这些模式在许多国家语言方案任务中取得了最新成果,改变了我们处理自然语言处理的方式。然而,培训前的固有性质意味着它们不可能掌握具体任务统计资料或学习特定领域知识。此外,这些模式的实施通常不使用有条件随机字段,这是一种同时进行象征性分类的方法。我们表明,这些修改可以改善SemEval-2021的有毒吸食者检测任务模型的绩效,从而在最高绩效团队中达到4个百分点的得分。