Detecting online hate is a complex task, and low-performing detection models have harmful consequences when used for sensitive applications such as content moderation. Emoji-based hate is a key emerging challenge for online hate detection. We present HatemojiCheck, a test suite of 3,930 short-form statements that allows us to evaluate how detection models perform on hateful language expressed with emoji. Using the test suite, we expose weaknesses in existing hate detection models. To address these weaknesses, we create the HatemojiTrain dataset using an innovative human-and-model-in-the-loop approach. Models trained on these 5,912 adversarial examples perform substantially better at detecting emoji-based hate, while retaining strong performance on text-only hate. Both HatemojiCheck and HatemojiTrain are made publicly available.
翻译:检测网上仇恨是一项复杂的任务,而低效检测模型在用于诸如内容调适等敏感应用时会产生有害后果。基于Emoji的仇恨是在线仇恨检测方面新出现的一项关键挑战。 我们展示了由3,930个短式声明组成的测试套件Hatemoji Check,它让我们能够评估检测模型如何运用与emoji表达的仇恨语言。我们使用测试套件,暴露了现有仇恨检测模型的弱点。为了解决这些弱点,我们使用创新的人类和模范流动方法创建了HatemojiTrain数据集。关于这5,912个对抗性仇恨的模型在发现基于情感的仇恨方面表现要好得多,同时保持只使用文本的仇恨方面的强绩。Hatemoji Check和HatemojiTrain都公开提供。