Models trained on large unlabeled corpora of human interactions will learn patterns and mimic behaviors therein, which include offensive or otherwise toxic behavior and unwanted biases. We investigate a variety of methods to mitigate these issues in the context of open-domain generative dialogue models. We introduce a new human-and-model-in-the-loop framework for both training safer models and for evaluating them, as well as a novel method to distill safety considerations inside generative models without the use of an external classifier at deployment time. We conduct experiments comparing these methods and find our new techniques are (i) safer than existing models as measured by automatic and human evaluations while (ii) maintaining usability metrics such as engagingness relative to the state of the art. We then discuss the limitations of this work by analyzing failure cases of our models.
翻译:在大型未贴标签的人类互动关系中培训的模型将学习其中的模式和模仿行为,包括攻击性或其它有毒行为和不想要的偏见。我们调查了在开放域基因对话模式中减轻这些问题的各种方法。我们引入了新的人和模型在环形中培训更安全模型和评估这些模型的框架,以及一种在配置时不使用外部分类器在基因化模型中提炼安全考虑的新方法。我们对这些方法进行比较并发现我们的新技术(一)比现有模型更安全,以自动和人为评估来衡量,而(二)保持可使用性指标,例如与艺术状态有关的接触程度。然后我们通过分析模型的失败案例来讨论这项工作的局限性。