Toxic language detection systems often falsely flag text that contains minority group mentions as toxic, as those groups are often the targets of online hate. Such over-reliance on spurious correlations also causes systems to struggle with detecting implicitly toxic language. To help mitigate these issues, we create ToxiGen, a new large-scale and machine-generated dataset of 274k toxic and benign statements about 13 minority groups. We develop a demonstration-based prompting framework and an adversarial classifier-in-the-loop decoding method to generate subtly toxic and benign text with a massive pretrained language model. Controlling machine generation in this way allows ToxiGen to cover implicitly toxic text at a larger scale, and about more demographic groups, than previous resources of human-written text. We conduct a human evaluation on a challenging subset of ToxiGen and find that annotators struggle to distinguish machine-generated text from human-written language. We also find that 94.5% of toxic examples are labeled as hate speech by human annotators. Using three publicly-available datasets, we show that finetuning a toxicity classifier on our data improves its performance on human-written data substantially. We also demonstrate that ToxiGen can be used to fight machine-generated toxicity as finetuning improves the classifier significantly on our evaluation subset. Our code and data can be found at https://github.com/microsoft/ToxiGen.
翻译:含有少数群体的有毒语言检测系统往往错误地标出含有少数群体的毒性,因为这些群体往往是网上仇恨的目标。这种过度依赖虚假的关联性还导致系统在检测隐含有毒语言方面挣扎。为了帮助缓解这些问题,我们创建了托西根(ToxiGen),这是一个关于13个少数群体的274k有毒和良性声明的大型和机器生成的新数据集。我们开发了一个基于演示的提示框架和一个对抗性分类器(loop解码)解码方法,以产生具有大规模预先训练语言模型的低毒性和良性文本。通过这种方式的机器生成,使托西根(ToxiGen)能够以更大的规模覆盖隐含有毒文本,并覆盖更多的人口群体。我们对托西根(ToxiGen)的一组挑战性数据进行了人类评估,发现警告器试图将机器生成的文本与人类书面语言区别开来。我们还发现,94.5%的毒性实例被贴上名为“憎恶性言论”。使用三种公开的数据集,我们展示了对数据分类的毒性分类的微调,也显示我们用于机器的G的精确度数据。我们用来改进了我们的机器税化数据。