While large neural-based conversational models have become increasingly proficient as dialogue agents, recent work has highlighted safety issues with these systems. For example, these systems can be goaded into generating toxic content, which often perpetuates social biases or stereotypes. We investigate a retrieval-based framework for reducing bias and toxicity in responses generated from neural-based chatbots. It uses in-context learning to steer a model towards safer generations. Concretely, to generate a response to an unsafe dialogue context, we retrieve demonstrations of safe model responses to similar dialogue contexts. We find our proposed approach performs competitively with strong baselines which use fine-tuning. For instance, using automatic evaluation, we find our best fine-tuned baseline only generates safe responses to unsafe dialogue contexts from DiaSafety 2.92% more than our approach. Finally, we also propose a straightforward re-ranking procedure which can further improve response safeness.
翻译:虽然大型神经对话模式作为对话促进者越来越熟练,但最近的工作突出了这些系统的安全问题,例如,这些系统可被用于产生有毒内容,这往往使社会偏见或陈规定型观念永久化。我们调查一个基于检索的框架,以减少神经聊天室反应中的偏差和毒性,利用内文学习引导一个模式走向更安全的世代。具体地说,为了产生对不安全对话环境的反应,我们检索了类似对话环境的安全模式反应示范。我们发现,我们提出的方法具有竞争性,有强大的基线,使用微调。例如,我们采用自动评估,发现我们最佳的精细调整基线只能产生对DiaSafety 2.92%的不安全对话环境的安全反应。最后,我们还提出一个直接的重新排列程序,可以进一步提高应对安全性。