评分形状观测内部模型 (Internal Model from Observations for Reward Shaping)

Reinforcement learning methods require careful design involving a reward function to obtain the desired action policy for a given task. In the absence of hand-crafted reward functions, prior work on the topic has proposed several methods for reward estimation by using expert state trajectories and action pairs. However, there are cases where complete or good action information cannot be obtained from expert demonstrations. We propose a novel reinforcement learning method in which the agent learns an internal model of observation on the basis of expert-demonstrated state trajectories to estimate rewards without completely learning the dynamics of the external environment from state-action pairs. The internal model is obtained in the form of a predictive model for the given expert state distribution. During reinforcement learning, the agent predicts the reward as a function of the difference between the actual state and the state predicted by the internal model. We conducted multiple experiments in environments of varying complexity, including the Super Mario Bros and Flappy Bird games. We show our method successfully trains good policies directly from expert game-play videos.

翻译：强化学习方法需要谨慎设计,涉及奖励功能,以获得某项任务所需的行动政策。在没有手工制作的奖赏功能的情况下,关于这一专题的先前工作通过使用专家国家轨迹和行动对等方法提出了几种奖赏估算方法。然而,有些案例无法从专家演示中获得完整或良好的行动信息。我们建议了一种新的强化学习方法,即代理人在专家示范的国家轨迹的基础上学习内部观察模式,在不完全从国家行动对等中了解外部环境动态的情况下估计奖赏。内部模型以特定专家国家分布的预测模型的形式获得。在强化学习期间,代理人预测奖励是实际状态与内部模型预测的状态之间的差别的函数。我们在不同复杂的环境中进行了多次实验,包括超级马里奥兄弟和Flappy Bird游戏。我们展示了我们的方法,我们从专家游戏视频中直接培训好的政策。

相关内容

MoDELS

关注 43

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

因果图，Causal Graphs，52页ppt

专知会员服务

250+阅读 · 2020年4月19日

TensorFlow 2.2为keras.Model加入train_step方法，开发者可自由定义模型自动训练过程

专知会员服务

36+阅读 · 2020年3月27日

100+篇《自监督学习(Self-Supervised Learning)》论文最新合集

专知会员服务

165+阅读 · 2020年3月18日

深度强化学习策略梯度教程，53页ppt

专知会员服务

184+阅读 · 2020年2月1日