How can we train a dialog model to produce better conversations by learning from human feedback, without the risk of humans teaching it harmful chat behaviors? We start by hosting models online, and gather human feedback from real-time, open-ended conversations, which we then use to train and improve the models using offline reinforcement learning (RL). We identify implicit conversational cues including language similarity, elicitation of laughter, sentiment, and more, which indicate positive human feedback, and embed these in multiple reward functions. A well-known challenge is that learning an RL policy in an offline setting usually fails due to the lack of ability to explore and the tendency to make over-optimistic estimates of future reward. These problems become even harder when using RL for language models, which can easily have a 20,000 action vocabulary and many possible reward functions. We solve the challenge by developing a novel class of offline RL algorithms. These algorithms use KL-control to penalize divergence from a pre-trained prior language model, and use a new strategy to make the algorithm pessimistic, instead of optimistic, in the face of uncertainty. We test the resulting dialog model with ratings from 80 users in an open-domain setting and find it achieves significant improvements over existing deep offline RL approaches. The novel offline RL method is viable for improving any existing generative dialog model using a static dataset of human feedback.
翻译:我们如何从人类反馈中学习对话模式,以便通过学习人类反馈来产生更好的对话,而没有人类教授有害聊天行为的风险?我们首先在网上托管模型,从实时、开放式对话中收集人类反馈,然后用在线强化学习(RL)来培训和改进模型。我们发现隐含的谈话提示,包括语言相似性、引人笑笑、情绪等,表明积极的人类反馈,并将其嵌入多种奖励功能。一个众所周知的挑战在于,在离线环境中学习RL政策通常失败,原因是缺乏探索能力,而且倾向于对未来奖励作出过度乐观的估计。在使用RL来使用语言模型时,这些问题变得更加困难,因为语言模型很容易有20,000个行动词汇和许多可能的奖励功能。我们通过开发新型的离线 RL 算法来应对挑战。这些算法使用KL 控制来惩罚与事先经过培训的模型之前的静态语言模型的差异,并使用新策略在不确定性面前使算算法悲观而不是乐观。我们测试从现有80个用户的公开的RL 数据模型中得出了一种可操作性的新模式。