Reinforcement learning methods require careful design involving a reward function to obtain the desired action policy for a given task. In the absence of hand-crafted reward functions, prior work on the topic has proposed several methods for reward estimation by using expert state trajectories and action pairs. However, there are cases where complete or good action information cannot be obtained from expert demonstrations. We propose a novel reinforcement learning method in which the agent learns an internal model of observation on the basis of expert-demonstrated state trajectories to estimate rewards without completely learning the dynamics of the external environment from state-action pairs. The internal model is obtained in the form of a predictive model for the given expert state distribution. During reinforcement learning, the agent predicts the reward as a function of the difference between the actual state and the state predicted by the internal model. We conducted multiple experiments in environments of varying complexity, including the Super Mario Bros and Flappy Bird games. We show our method successfully trains good policies directly from expert game-play videos.
翻译:强化学习方法需要谨慎设计,涉及奖励功能,以获得某项任务所需的行动政策。在没有手工制作的奖赏功能的情况下,关于这一专题的先前工作通过使用专家国家轨迹和行动对等方法提出了几种奖赏估算方法。然而,有些案例无法从专家演示中获得完整或良好的行动信息。我们建议了一种新的强化学习方法,即代理人在专家示范的国家轨迹的基础上学习内部观察模式,在不完全从国家行动对等中了解外部环境动态的情况下估计奖赏。内部模型以特定专家国家分布的预测模型的形式获得。在强化学习期间,代理人预测奖励是实际状态与内部模型预测的状态之间的差别的函数。我们在不同复杂的环境中进行了多次实验,包括超级马里奥兄弟和Flappy Bird游戏。我们展示了我们的方法,我们从专家游戏视频中直接培训好的政策。