Warning: this paper contains content that maybe offensive or upsetting. Recent research in Natural Language Processing (NLP) has advanced the development of various toxicity detection models with the intention of identifying and mitigating toxic language from existing systems. Despite the abundance of research in this area, less attention has been given to adversarial attacks that force the system to generate toxic language and the defense against them. Existing work to generate such attacks is either based on human-generated attacks which is costly and not scalable or, in case of automatic attacks, the attack vector does not conform to human-like language, which can be detected using a language model loss. In this work, we propose attacks against conversational agents that are imperceptible, i.e., they fit the conversation in terms of coherency, relevancy, and fluency, while they are effective and scalable, i.e., they can automatically trigger the system into generating toxic language. We then propose a defense mechanism against such attacks which not only mitigates the attack but also attempts to maintain the conversational flow. Through automatic and human evaluations, we show that our defense is effective at avoiding toxic language generation even against imperceptible toxicity triggers while the generated language fits the conversation in terms of coherency and relevancy. Lastly, we establish the generalizability of such a defense mechanism on language generation models beyond conversational agents.
翻译:本文的内容可能具有冒犯性或破坏性。 最近对自然语言处理(NLP)的研究已经推进了各种毒性检测模型的开发,目的是从现有系统中识别和减少有毒语言。尽管在这方面进行了大量研究,但对于迫使该系统产生有毒语言并防患于未然的对抗性攻击的注意却较少。现有的制造这种攻击的工作要么是基于人类引发的攻击,这种攻击的代价高昂,而且无法伸缩,或者在出现自动攻击的情况下,攻击矢量不符合人种语言,而这种语言可以通过语言模型丢失而探测出来。在这项工作中,我们提议对不易察觉的、即不易察觉和流畅的谈话剂进行攻击。尽管这种攻击使系统产生有毒语言并具有有效性和可伸缩性,即它们可以自动触发系统生成有毒语言。我们然后提议一种防止这种攻击的防御机制,不仅减轻攻击,而且试图维持对话流。通过自动和人种评估,我们表明我们的防御在避免毒性语言生成性、适切性语言时,在避免产生不易感应变的语言生成的防御性方面是有效的。