Undermining the impact of hateful content with informed and non-aggressive responses, called counter narratives, has emerged as a possible solution for having healthier online communities. Thus, some NLP studies have started addressing the task of counter narrative generation. Although such studies have made an effort to build hate speech / counter narrative (HS/CN) datasets for neural generation, they fall short in reaching either high-quality and/or high-quantity. In this paper, we propose a novel human-in-the-loop data collection methodology in which a generative language model is refined iteratively by using its own data from the previous loops to generate new training samples that experts review and/or post-edit. Our experiments comprised several loops including dynamic variations. Results show that the methodology is scalable and facilitates diverse, novel, and cost-effective data collection. To our knowledge, the resulting dataset is the only expert-based multi-target HS/CN dataset available to the community.
翻译:以知情和非侵略性回应(称为反叙述)来探究仇恨内容的影响,这已成为实现更健康的在线社区的一个可能解决办法,因此,国家语言方案的一些研究已开始处理反叙述生成的任务,虽然这些研究努力为神经生成建立仇恨言论/反叙述(HS/CN)数据集,但不足以达到高质量和(或)高数量。在本文件中,我们提议采用新的“人与人之间流动数据收集方法”,利用以前循环中的数据来迭接地完善一种基因化语言模型,以产生新的培训样本,供专家审查和(或)编辑后使用。我们的实验包括若干循环,包括动态变化。结果显示,该方法可扩展,便于多样化、新颖和成本效益高的数据收集。据我们所知,由此产生的数据集是唯一可供社区使用的专家多目标HS/CN数据集。