Dialogue models trained on human conversations inadvertently learn to generate toxic responses. In addition to producing explicitly offensive utterances, these models can also implicitly insult a group or individual by aligning themselves with an offensive statement. To better understand the dynamics of contextually offensive language, we investigate the stance of dialogue model responses in offensive Reddit conversations. Specifically, we create ToxiChat, a crowd-annotated dataset of 2,000 Reddit threads and model responses labeled with offensive language and stance. Our analysis reveals that 42% of human responses agree with toxic comments, whereas only 13% agree with safe comments. This undesirable behavior is learned by neural dialogue models, such as DialoGPT, which we show are two times more likely to agree with offensive comments. To enable automatic detection of offensive language, we fine-tuned transformer-based classifiers on ToxiChat that achieve 0.71 F1 for offensive labels and 0.53 Macro-F1 for stance labels. Finally, we quantify the effectiveness of controllable text generation (CTG) methods to mitigate the tendency of neural dialogue models to agree with offensive comments. Compared to the baseline, our best CTG model achieves a 19% reduction in agreement with offensive comments and produces 29% fewer offensive replies. Our work highlights the need for further efforts to characterize and analyze inappropriate behavior in dialogue models, in order to help make them safer. Our code and corpus are available at https://github.com/abaheti95/ToxiChat .
翻译:人类谈话中经过培训的对话模型无意中无意中学会产生有毒反应。除了产生明显冒犯性言论外,这些模型还可以通过与攻击性言论保持一致来暗示侮辱一个群体或个人。为了更好地了解攻击性雷迪迪对话中的对话模式反应态势,我们调查了攻击性雷迪对话中的对话模式反应态势。具体地说,我们创建了ToxiChat,这是一个由2 000条红色线和标有攻击性语言和姿态的标注2 000条红色线和模型反应组成的众注数据集。我们的分析表明,42%的人反应与有毒评论一致,而只有13%的人反应与安全评论一致。通过神经对话模型(如Dial-C-GPT)来学习这种不良行为,我们显示,这比攻击性评论的可能性高两倍。为了能够自动检测攻击性语言,我们精细调整了以变压式变压器为基础的托西Chat变压器分类,在攻击性标签中实现了0.71 F1和0.53 宏观-F1。最后,我们量化了可调制文本生成方法的有效性,以降低神经对话模式与攻击性评论的倾向。比较,我们的最佳C-CT-95的模型与攻击性对话模型比重。我们更接近了19的排序,在对话模型中,在分析工作上更需要调整了19的排序。