Chatbots are used in many applications, e.g., automated agents, smart home assistants, interactive characters in online games, etc. Therefore, it is crucial to ensure they do not behave in undesired manners, providing offensive or toxic responses to users. This is not a trivial task as state-of-the-art chatbot models are trained on large, public datasets openly collected from the Internet. This paper presents a first-of-its-kind, large-scale measurement of toxicity in chatbots. We show that publicly available chatbots are prone to providing toxic responses when fed toxic queries. Even more worryingly, some non-toxic queries can trigger toxic responses too. We then set out to design and experiment with an attack, ToxicBuddy, which relies on fine-tuning GPT-2 to generate non-toxic queries that make chatbots respond in a toxic manner. Our extensive experimental evaluation demonstrates that our attack is effective against public chatbot models and outperforms manually-crafted malicious queries proposed by previous work. We also evaluate three defense mechanisms against ToxicBuddy, showing that they either reduce the attack performance at the cost of affecting the chatbot's utility or are only effective at mitigating a portion of the attack. This highlights the need for more research from the computer security and online safety communities to ensure that chatbot models do not hurt their users. Overall, we are confident that ToxicBuddy can be used as an auditing tool and that our work will pave the way toward designing more effective defenses for chatbot safety.
翻译:聊天室被用于许多应用程序,例如自动化代理人、智能家庭助理、在线游戏中的交互式人物等。 因此,必须确保他们不以不希望看到的方式行事,向用户提供攻击性或有毒反应。 这并非一项无关紧要的任务,因为最先进的聊天室模型在大型公开从互联网上公开收集的公共数据集方面受过培训。 本文展示了对聊天室毒性的首创性大规模测量。 我们显示,在提供有毒查询时,公开提供的聊天室容易提供有毒反应。 更令人担忧的是, 一些非毒性查询也可能引发有毒反应。 我们随后开始设计和试验攻击, 毒气布迪, 因为它依靠微调GPT-2 来生成无毒性查询, 使聊天室以有毒方式公开收集。 我们的广泛实验显示,我们的攻击对公共聊天室模型有效, 并超越了手动的恶意查询。 我们还评估了三个防毒床迪的防御机制, 更令人担忧的是, 一些非毒气查询也会触发有毒反应。