Large pretrained language models can easily produce toxic or biased content, which is prohibitive for practical use. In order to detect such toxic generations, existing methods rely on templates, real-world data extraction, crowdsourcing workers, or automatic generation to construct adversarial contexts that are likely to induce toxic generations. However, what type of context is more likely to induce unsafe responses is still under-explored. In this paper, we identify that context toxicity and context category (e.g., \textit{profanity}, \textit{insult}, \textit{drugs}, etc.) are two important factors to cause safety issues in response generation. Hence, we propose a method called \emph{reverse generation} to construct adversarial contexts conditioned on a given response, with the flexibility to control category, toxicity level, and inductivity of the generated contexts. Via reverse generation, we augment the existing BAD dataset and construct a new dataset BAD+ which contains more than 120K diverse and highly inductive contexts in 12 categories. We test three popular pretrained dialogue models (Blender, DialoGPT, and Plato2) and find that BAD+ can largely expose their safety problems. Furthermore, we show that BAD+ can greatly enhance the safety of generation and reveal the key factors of safety improvement. Our code and dataset is available at \url{https://github.com/thu-coai/Reverse_Generation}.
翻译:大型预先培训的语言模型很容易产生有毒或偏颇的内容,这种内容对于实际用途来说是令人望而却步的。为了检测这些有毒世代,现有方法依靠模板、真实世界的数据提取、众包工人或自动生成来构建可能引起有毒世代的敌对环境。然而,哪些类型的环境更可能引发不安全反应,仍然未得到充分探讨。在本文中,我们确定了环境毒性和上下文类别(例如,\ textit{proferity},\ textit{sult},\textit{drugs}等)是造成反应生成安全问题的两个重要因素。因此,我们提议了一个称为\emph{反代人}的方法来构建敌对环境,以特定反应为条件,具有控制类别的灵活性、毒性水平和生成环境的不易感知性。维亚反向生成,我们增加了现有的BAD数据集,在12个类别中含有超过120K的多样性和高度的描述性背景。我们测试了三种大众前对话模式(Blender, DiaGPT) 和BAD 能够大大加强我们的安全性。