通过离线强化学习进行以人为中心的对话培训 (Human-centric Dialog Training via Offline Reinforcement Learning)

How can we train a dialog model to produce better conversations by learning from human feedback, without the risk of humans teaching it harmful chat behaviors? We start by hosting models online, and gather human feedback from real-time, open-ended conversations, which we then use to train and improve the models using offline reinforcement learning (RL). We identify implicit conversational cues including language similarity, elicitation of laughter, sentiment, and more, which indicate positive human feedback, and embed these in multiple reward functions. A well-known challenge is that learning an RL policy in an offline setting usually fails due to the lack of ability to explore and the tendency to make over-optimistic estimates of future reward. These problems become even harder when using RL for language models, which can easily have a 20,000 action vocabulary and many possible reward functions. We solve the challenge by developing a novel class of offline RL algorithms. These algorithms use KL-control to penalize divergence from a pre-trained prior language model, and use a new strategy to make the algorithm pessimistic, instead of optimistic, in the face of uncertainty. We test the resulting dialog model with ratings from 80 users in an open-domain setting and find it achieves significant improvements over existing deep offline RL approaches. The novel offline RL method is viable for improving any existing generative dialog model using a static dataset of human feedback.

翻译：我们如何从人类反馈中学习对话模式,以便通过学习人类反馈来产生更好的对话,而没有人类教授有害聊天行为的风险?我们首先在网上托管模型,从实时、开放式对话中收集人类反馈,然后用在线强化学习(RL)来培训和改进模型。我们发现隐含的谈话提示,包括语言相似性、引人笑笑、情绪等,表明积极的人类反馈,并将其嵌入多种奖励功能。一个众所周知的挑战在于,在离线环境中学习RL政策通常失败,原因是缺乏探索能力,而且倾向于对未来奖励作出过度乐观的估计。在使用RL来使用语言模型时,这些问题变得更加困难,因为语言模型很容易有20,000个行动词汇和许多可能的奖励功能。我们通过开发新型的离线 RL 算法来应对挑战。这些算法使用KL 控制来惩罚与事先经过培训的模型之前的静态语言模型的差异,并使用新策略在不确定性面前使算算法悲观而不是乐观。我们测试从现有80个用户的公开的RL 数据模型中得出了一种可操作性的新模式。

相关内容

MoDELS

关注 43

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

可解释强化学习，Explainable Reinforcement Learning: A Survey

专知会员服务

131+阅读 · 2020年5月14日

【干货书】真实机器学习，264页pdf，Real-World Machine Learning

专知会员服务

115+阅读 · 2020年4月5日