Transfer reinforcement learning aims to improve the sample efficiency of solving unseen new tasks by leveraging experiences obtained from previous tasks. We consider the setting where all tasks (MDPs) share the same environment dynamic except reward function. In this setting, the MDP dynamic is a good knowledge to transfer, which can be inferred by uniformly random policy. However, trajectories generated by uniform random policy are not useful for policy improvement, which impairs the sample efficiency severely. Instead, we observe that the binary MDP dynamic can be inferred from trajectories of any policy which avoids the need of uniform random policy. As the binary MDP dynamic contains the state structure shared over all tasks we believe it is suitable to transfer. Built on this observation, we introduce a method to infer the binary MDP dynamic on-line and at the same time utilize it to guide state embedding learning, which is then transferred to new tasks. We keep state embedding learning and policy learning separately. As a result, the learned state embedding is task and policy agnostic which makes it ideal for transfer learning. In addition, to facilitate the exploration over the state space, we propose a novel intrinsic reward based on the inferred binary MDP dynamic. Our method can be used out-of-box in combination with model-free RL algorithms. We show two instances on the basis of \algo{DQN} and \algo{A2C}. Empirical results of intensive experiments show the advantage of our proposed method in various transfer learning tasks.
翻译:转让强化学习的目的是通过利用从以往任务中获得的经验来提高解决未见新任务的样本效率。 我们考虑所有任务( MDPs) 共享相同环境动态的设置, 除了奖赏功能之外。 在这种设置中, MDP 动态是一种可以传输的良好知识, 可以通过统一的随机政策进行推断。 但是, 统一随机政策产生的轨迹对政策改进没有用处, 这严重损害了样本效率。 相反, 我们观察到, 双轨 MDP 动态可以从任何避免统一随机政策需要的政策轨迹中推断出来。 由于二进制 MDP 动态包含所有任务共享的州结构, 我们认为它适合转移。 在此背景下, 我们引入一种方法, 将二进制 MDP 动态, 并同时用它来指导状态嵌入学习, 从而严重妨碍样本效率。 我们保持状态嵌入学习和政策学习。 因此, 学习状态是任务和政策的激励, 使得它成为转移学习的理想。 此外, 我们用新式的 RD 模式和 RR 模式 展示了我们使用的新式组合方法 。