Detecting "toxic" language in internet content is a pressing social and technical challenge. In this work, we focus on PERSPECTIVE from Jigsaw, a state-of-the-art tool that promises to score the "toxicity" of text, with a recent model update that claims impressive results (Lees et al., 2022). We seek to challenge certain normative claims about toxic language by proposing a new benchmark, Selected Adversarial SemanticS, or SASS. We evaluate PERSPECTIVE on SASS, and compare to low-effort alternatives, like zero-shot and few-shot GPT-3 prompt models, in binary classification settings. We find that PERSPECTIVE exhibits troubling shortcomings across a number of our toxicity categories. SASS provides a new tool for evaluating performance on previously undetected toxic language that avoids common normative pitfalls. Our work leads us to emphasize the importance of questioning assumptions made by tools already in deployment for toxicity detection in order to anticipate and prevent disparate harms.
翻译:检测互联网内容中的“有毒”语言是一项紧迫的社会和技术挑战。 在这项工作中,我们侧重于来自Jigsaw的最新工具Jigsaw的见解,该工具有望在文本的“毒性”中得分,最近的模型更新称取得了令人印象深刻的成果(Lees等人,2022年)。我们试图通过提出一个新的基准,即选定的对立语词义,或SAS,来质疑某些关于有毒语言的规范性主张。我们评估了SASS的实用性,并比较了二进制分类环境中的低效替代品,如零发和几发GPT-3快速模型。我们发现,Pespective在很多毒性类别中都存在令人不安的缺陷。SASS为评估先前未被发现的有毒语言的绩效提供了一个新工具,避免了常见的规范性缺陷。我们的工作引导我们强调,必须质疑已经部署的毒性检测工具所作的假设,以便预测和预防不同的伤害。