Standard approaches to hate speech detection rely on sufficient available hate speech annotations. Extending previous work that repurposes natural language inference (NLI) models for zero-shot text classification, we propose a simple approach that combines multiple hypotheses to improve English NLI-based zero-shot hate speech detection. We first conduct an error analysis for vanilla NLI-based zero-shot hate speech detection and then develop four strategies based on this analysis. The strategies use multiple hypotheses to predict various aspects of an input text and combine these predictions into a final verdict. We find that the zero-shot baseline used for the initial error analysis already outperforms commercial systems and fine-tuned BERT-based hate speech detection models on HateCheck. The combination of the proposed strategies further increases the zero-shot accuracy of 79.4% on HateCheck by 7.9 percentage points (pp), and the accuracy of 69.6% on ETHOS by 10.0pp.
翻译:检测仇恨言论的标准方法依赖于足够的现有仇恨言论说明。 扩展先前将自然语言推断模型重新定位为零发文本分类的自然语言推断模型的工作,我们提出一个简单的方法,将多种假设结合起来,改进基于英语NLI的零发仇恨言论检测。 我们首先对基于香草NLI的零发仇恨言论检测进行错误分析,然后根据这一分析制定四项战略。 这些战略使用多种假设来预测输入文本的各个方面,并将这些预测纳入最终判决。 我们发现,初步错误分析所使用的零点基准已经超过商业系统,并完善了基于BERT的仇恨言论检测模型。 各项拟议战略的结合进一步提高了以7.9个百分点(pp)和以10.0pp的69.6%对ETHOS的精确度。