The promise of interaction between intelligent conversational agents and humans is that models can learn from such feedback in order to improve. Unfortunately, such exchanges in the wild will not always involve human utterances that are benign or of high quality, and will include a mixture of engaged (helpers) and unengaged or even malicious users (trolls). In this work we study how to perform robust learning in such an environment. We introduce a benchmark evaluation, SafetyMix, which can evaluate methods that learn safe vs. toxic language in a variety of adversarial settings to test their robustness. We propose and analyze several mitigating learning algorithms that identify trolls either at the example or at the user level. Our main finding is that user-based methods, that take into account that troll users will exhibit adversarial behavior across multiple examples, work best in a variety of settings on our benchmark. We then test these methods in a further real-life setting of conversations collected during deployment, with similar results.
翻译:智能对话代理人和人类之间互动的希望在于模型可以从这些反馈中学习,以便改进。 不幸的是,野外的这种交流并不总能涉及良性或高质量的人类话语,而且包括参与者(助手)和不参与者或恶意用户(小鸡)的混合体。在这项工作中,我们研究如何在这种环境中进行强有力的学习。我们引入了基准评估,即安全混合,它可以评估在各种对抗环境中学习安全与有毒语言的方法,以测试其稳健性。我们提出并分析几种在实例或用户层面识别巨魔的缓解学习算法。我们的主要发现是,基于用户的方法,考虑到巨魔使用者将展示对抗行为,跨越多个例子,在各种环境中,在我们的基准上工作最优。我们随后在部署期间收集的、具有类似结果的谈话的更真实环境中测试这些方法。