Model-based reinforcement learning (RL) achieves higher sample efficiency in practice than model-free RL by learning a dynamics model to generate samples for policy learning. Previous works learn a "global" dynamics model to fit the state-action visitation distribution for all historical policies. However, in this paper, we find that learning a global dynamics model does not necessarily benefit model prediction for the current policy since the policy in use is constantly evolving. The evolving policy during training will cause state-action visitation distribution shifts. We theoretically analyze how the distribution of historical policies affects the model learning and model rollouts. We then propose a novel model-based RL method, named \textit{Policy-adaptation Model-based Actor-Critic (PMAC)}, which learns a policy-adapted dynamics model based on a policy-adaptation mechanism. This mechanism dynamically adjusts the historical policy mixture distribution to ensure the learned model can continually adapt to the state-action visitation distribution of the evolving policy. Experiments on a range of continuous control environments in MuJoCo show that PMAC achieves state-of-the-art asymptotic performance and almost two times higher sample efficiency than prior model-based methods.
翻译:以模型为基础的强化学习(RL)在实践上比没有模型的RL(RL)在实践上取得更高的抽样效率,方法是学习一种动态模型,以生成政策学习样本。以前的作品学习一种“全球”动态模型,以适应所有历史政策的国家行动访问分布。然而,在本文件中,我们发现,学习一种全球动态模型并不一定有利于当前政策的预测模式,因为正在使用的政策正在不断演变。培训过程中不断演变的政策将导致州-行动访问分布的变化。我们从理论上分析历史政策的分布如何影响模型学习和模型推出。我们然后提出一种新的基于模型的RL方法,名为\textit{政策适应模型的Ander-Crictic(PMAC)},它学习一种基于政策适应机制的政策适应动态模型模型。这一机制动态调整了历史政策混合分布,以确保学习的模型能够持续适应正在演变的政策的州-行动访问分布。在MuJoco进行的一系列持续控制环境实验表明,PMAC(PMAC)实现了基于州模型的效率,而不是以往两次。