Detecting online hate is a complex task, and low-performing models have harmful consequences when used for sensitive applications such as content moderation. Emoji-based hate is a key emerging challenge for automated detection. We present HatemojiCheck, a test suite of 3,930 short-form statements that allows us to evaluate performance on hateful language expressed with emoji. Using the test suite, we expose weaknesses in existing hate detection models. To address these weaknesses, we create the HatemojiTrain dataset using a human-and-model-in-the-loop approach. Models trained on these 5,912 adversarial examples perform substantially better at detecting emoji-based hate, while retaining strong performance on text-only hate. Both HatemojiCheck and HatemojiTrain are made publicly available.
翻译:检测网上仇恨是一项复杂的任务,而低效模型在用于诸如内容调适等敏感应用时会产生有害后果。基于Emoji的仇恨是自动检测方面新出现的一项关键挑战。 我们展示了由3,930个短式声明组成的测试套件Hatemoji Check, 这套测试套件让我们能够评估用emoji表达的仇恨语言的表现。 我们使用测试套件暴露了现有仇恨检测模式的弱点。 为了解决这些弱点,我们使用人和模范“在网中”的方法创建了HatemojiTrain数据集。 以这些5,912个对抗性格范例为培训的模型在发现基于情感的仇恨方面表现要好得多,同时保持了对只使用文本的仇恨的有力表现。 Hatemoji Check和HatemojiTrain都公开提供。