Traditional model-based reinforcement learning (RL) methods generate forward rollout traces using the learnt dynamics model to reduce interactions with the real environment. The recent model-based RL method considers the way to learn a backward model that specifies the conditional probability of the previous state given the previous action and the current state to additionally generate backward rollout trajectories. However, in this type of model-based method, the samples derived from backward rollouts and those from forward rollouts are simply aggregated together to optimize the policy via the model-free RL algorithm, which may decrease both the sample efficiency and the convergence rate. This is because such an approach ignores the fact that backward rollout traces are often generated starting from some high-value states and are certainly more instructive for the agent to improve the behavior. In this paper, we propose the backward imitation and forward reinforcement learning (BIFRL) framework where the agent treats backward rollout traces as expert demonstrations for the imitation of excellent behaviors, and then collects forward rollout transitions for policy reinforcement. Consequently, BIFRL empowers the agent to both reach to and explore from high-value states in a more efficient manner, and further reduces the real interactions, making it potentially more suitable for real-robot learning. Moreover, a value-regularized generative adversarial network is introduced to augment the valuable states which are infrequently received by the agent. Theoretically, we provide the condition where BIFRL is superior to the baseline methods. Experimentally, we demonstrate that BIFRL acquires the better sample efficiency and produces the competitive asymptotic performance on various MuJoCo locomotion tasks compared against state-of-the-art model-based methods.
翻译:传统的基于模型的强化学习(RL)方法利用所学的动态模型产生前推出的痕迹,以减少与实际环境的相互作用。最近的基于模型的RL方法考虑了如何学习一个后向模型,该后向模型指出,鉴于先前的行动和当前状态,前一状态的有条件概率将进一步产生后向推出轨迹。然而,在这种基于模型的方法中,从后向推出和前向推出的样本将简单汇总,以便通过无模型的RL抽样算法优化政策,这可能会降低抽样效率和趋同率。这是因为这种方法忽视了以下事实:后向推出的痕迹往往从一些高价值国家产生,而且对于代理人改善行为的行为肯定更具启发性。在本文中,我们提出了后向和前向强化学习(BFRL)框架,即将后向推出的痕迹作为模仿行为表现的专家演示,然后收集前向前向推出的政策强化。因此,BFRL进一步赋予了代理人从高价值的提取和探索后向高价值的流程,我们从高价值的购买率国家获取的轨迹迹迹迹迹迹迹迹迹,通过更高效的网络进行真正的学习。