将仇恨言论探测推广到资源不足的语言的数据效率战略 (Data-Efficient Strategies for Expanding Hate Speech Detection into Under-Resourced Languages)

Hate speech is a global phenomenon, but most hate speech datasets so far focus on English-language content. This hinders the development of more effective hate speech detection models in hundreds of languages spoken by billions across the world. More data is needed, but annotating hateful content is expensive, time-consuming and potentially harmful to annotators. To mitigate these issues, we explore data-efficient strategies for expanding hate speech detection into under-resourced languages. In a series of experiments with mono- and multilingual models across five non-English languages, we find that 1) a small amount of target-language fine-tuning data is needed to achieve strong performance, 2) the benefits of using more such data decrease exponentially, and 3) initial fine-tuning on readily-available English data can partially substitute target-language data and improve model generalisability. Based on these findings, we formulate actionable recommendations for hate speech detection in low-resource language settings.

翻译：仇恨言论是一种全球现象,但到目前为止,大多数仇恨言论数据集都侧重于英语内容,这阻碍了以全世界数十亿种语言开发出更有效的仇恨言论检测模型。需要更多的数据,但指出仇恨内容的费用昂贵、耗时且对告示员可能有害。为了缓解这些问题,我们探索数据效率高的战略,将仇恨言论检测扩大到资源不足的语言。在对五种非英语语言的单一和多语言模式进行的一系列实验中,我们发现:(1) 需要少量目标语言微调数据才能取得强劲的绩效;(2) 使用更多此类数据的好处急剧减少;(3) 对可轻易获得的英语数据进行初步微调,可以部分替代目标语言数据,改进模式的可概括性。基于这些发现,我们为在低资源语言环境中的仇恨言论检测制定了可操作的建议。

相关内容

MoDELS

关注 43

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

高效可扩展图神经网络的研究进展，Recent Advances in Efficient and Scalable Graph Neural Networks

专知会员服务

78+阅读 · 2022年3月15日

零样本文本分类，Zero-Shot Learning for Text Classification

专知会员服务

97+阅读 · 2020年5月31日

图像分类技巧集，17页ppt《Bag of Tricks for Image Classification》

专知会员服务

95+阅读 · 2020年3月12日

【深度学习表格检测、信息提取和结构化】《Table Detection, Information Extraction and Structuring using Deep Learning》by Vihar Kurama

专知会员服务

38+阅读 · 2020年1月23日