Solving multi-goal reinforcement learning (RL) problems with sparse rewards is generally challenging. Existing approaches have utilized goal relabeling on collected experiences to alleviate issues raised from sparse rewards. However, these methods are still limited in efficiency and cannot make full use of experiences. In this paper, we propose Model-based Hindsight Experience Replay (MHER), which exploits experiences more efficiently by leveraging environmental dynamics to generate virtual achieved goals. Replacing original goals with virtual goals generated from interaction with a trained dynamics model leads to a novel relabeling method, model-based relabeling (MBR). Based on MBR, MHER performs both reinforcement learning and supervised learning for efficient policy improvement. Theoretically, we also prove the supervised part in MHER, i.e., goal-conditioned supervised learning with MBR data, optimizes a lower bound on the multi-goal RL objective. Experimental results in several point-based tasks and simulated robotics environments show that MHER achieves significantly higher sample efficiency than previous model-free and model-based multi-goal methods.
翻译:以微薄回报解决多目标强化学习(RL)问题通常具有挑战性。现有方法利用所收集的经验重新标注目标,以缓解微薄回报产生的问题。然而,这些方法的效率仍然有限,无法充分利用经验。在本文件中,我们提议采用基于模型的 " 闪光观察经验重现 " (MHER),通过利用环境动态来更有效地利用经验,产生虚拟实现的目标。将原始目标替换为与经过培训的动态模型互动产生的虚拟目标,导致一种新型的重新标签、基于模型的重新标签(MBR)方法。根据MBR,MHER既进行强化学习,又监督学习,以有效改进政策。理论上,我们还证明在MHER,即以MSR数据进行有监督的有监督的学习,优化多目标RL目标目标的较低约束。若干点任务和模拟机器人环境的实验结果显示,MHER取得了比以往无模型和基于模型的多目标方法更高的样本效率。