We consider the problem of policy transfer between two Markov Decision Processes (MDPs). We introduce a lemma based on existing theoretical results in reinforcement learning (RL) to measure the relativity between two arbitrary MDPs, that is the difference between any two cumulative expected returns defined on different policies and environment dynamics. Based on this lemma, we propose two new algorithms referred to as Relative Policy Optimization (RPO) and Relative Transition Optimization (RTO), which can offer fast policy transfer and dynamics modeling, respectively. RPO updates the policy using the relative policy gradient to transfer the policy evaluated in one environment to maximize the return in another, while RTO updates the parameterized dynamics model (if there exists) using the relative transition gradient to reduce the gap between the dynamics of the two environments. Then, integrating the two algorithms offers the complete algorithm Relative Policy-Transition Optimization (RPTO), in which the policy interacts with the two environments simultaneously, such that data collections from two environments, policy and transition updates are completed in one closed loop to form a principled learning framework for policy transfer. We demonstrate the effectiveness of RPTO in OpenAI gym's classic control tasks by creating policy transfer problems via variant dynamics.
翻译:我们考虑了两个Markov决策程序(MDPs)之间的政策转移问题。我们引入了基于强化学习(RL)中现有理论结果的现有理论结果的利玛,以衡量两个任意的MDP(RL)之间的相对相对性,即根据不同政策和环境动态确定的任何两种累积预期回报之间的差别。基于这个利玛,我们提出两个新的算法,称为相对政策优化(RPO)和相对过渡优化(RTO),这可以分别提供快速的政策转移和动态建模。 REPO更新了政策,使用相对的政策梯度将在一个环境中评价的政策转移到另一个环境中,以最大限度地实现回报,而RTO更新了参数化动态模型(如果存在的话),使用相对过渡梯度来缩小两种环境动态之间的差距。然后,我们提出将这两种算法结合起来,提供完整的算法相对政策优化(RPTO)和相对过渡优化(RTO),在这两种环境中同时进行互动,例如从两个环境中收集数据、政策和过渡更新完成一个封闭的回路,形成一个有原则的政策转移政策的框架。我们通过SOM的变式政策转移任务中,通过Simal RPRAV的变式的变式任务产生效力。