With the COVID-19 pandemic continuing, hatred against Asians is intensifying in countries outside Asia, especially among the Chinese. There is an urgent need to detect and prevent hate speech towards Asians effectively. In this work, we first create COVID-HATE-2022, an annotated dataset including 2,025 annotated tweets fetched in early February 2022, which are labeled based on specific criteria, and we present the comprehensive collection of scenarios of hate and non-hate tweets in the dataset. Second, we fine-tune the BERT model based on the relevant datasets and demonstrate several strategies related to the "cleaning" of the tweets. Third, we investigate the performance of advanced fine-tuning strategies with various model-centric and data-centric approaches, and we show that both strategies generally improve the performance, while data-centric ones outperform the others, and it demonstrates the feasibility and effectiveness of the data-centric approaches in the associated tasks.
翻译:随着COVID-19大流行的继续,亚洲以外的国家,特别是中国,对亚洲人的仇恨正在加剧。迫切需要有效地发现和防止针对亚洲人的仇恨言论。在这项工作中,我们首先创建了COVID-HATE-2022,这是一个附加说明的数据集,包括2022年2月初收到的2 025条附加说明的推文,这些推文贴上了具体标准的标签,我们在数据集中全面收集仇恨和非仇恨推文的情景。第二,我们根据相关数据集对BERT模型进行微调,并展示了与“清理”这些推文有关的若干战略。第三,我们用各种以模型为中心的和以数据为中心的方法调查高级微调战略的绩效,我们表明,这两种战略总体上都改善了绩效,而以数据为中心的推文则优于其他标准。它展示了相关任务中以数据为中心的方法的可行性和有效性。