We present Sparrow, an information-seeking dialogue agent trained to be more helpful, correct, and harmless compared to prompted language model baselines. We use reinforcement learning from human feedback to train our models with two new additions to help human raters judge agent behaviour. First, to make our agent more helpful and harmless, we break down the requirements for good dialogue into natural language rules the agent should follow, and ask raters about each rule separately. We demonstrate that this breakdown enables us to collect more targeted human judgements of agent behaviour and allows for more efficient rule-conditional reward models. Second, our agent provides evidence from sources supporting factual claims when collecting preference judgements over model statements. For factual questions, evidence provided by Sparrow supports the sampled response 78% of the time. Sparrow is preferred more often than baselines while being more resilient to adversarial probing by humans, violating our rules only 8% of the time when probed. Finally, we conduct extensive analyses showing that though our model learns to follow our rules it can exhibit distributional biases.
翻译:我们介绍麻雀,这是一个经过培训的寻求信息的对话代理人,其作用是更有益、正确和无害,与推介的语言模型基线相比。我们利用人类反馈的强化学习来培训我们的模型,用两个新的补充来培训我们的模型,以帮助人类计票员判断代理人的行为。首先,为了使我们的代理人更加有用和无害,我们将良好对话的要求细分为代理人应该遵循的自然语言规则,并分别询问每个规则的评分员。我们证明,这一故障使我们能够收集更有针对性的人对代理人行为的判断,并允许建立更有效的规则条件奖赏模式。第二,我们的代理人提供证据,来源支持事实索赔,以收集优于示范声明的优待判决。关于事实问题,Sparrow提供的证据支持抽样答复的时间为78%。麻雀往往比基线更受青睐,同时更耐受人类对抗性调查的考验,只违反我们规则的时间的8%。最后,我们进行了广泛的分析,表明虽然我们的模型学会遵守我们的规则,但能够显示分布偏差。