In Goal-oriented Reinforcement learning, relabeling the raw goals in past experience to provide agents with hindsight ability is a major solution to the reward sparsity problem. In this paper, to enhance the diversity of relabeled goals, we develop FGI (Foresight Goal Inference), a new relabeling strategy that relabels the goals by looking into the future with a learned dynamics model. Besides, to improve sample efficiency, we propose to use the dynamics model to generate simulated trajectories for policy training. By integrating these two improvements, we introduce the MapGo framework (Model-Assisted Policy Optimization for Goal-oriented tasks). In our experiments, we first show the effectiveness of the FGI strategy compared with the hindsight one, and then show that the MapGo framework achieves higher sample efficiency when compared to model-free baselines on a set of complicated tasks.
翻译:在面向目标的加强学习中,将以往经验中的原始目标重新贴上标签,为代理人提供事后观察能力,是解决奖励过度问题的主要办法。在本文中,为了提高重新标签目标的多样性,我们制定了FGI(前瞻性目标推论),这是一个新的重新标签战略,通过以一个学习的动态模型展望未来,重新标签目标。此外,为了提高抽样效率,我们提议使用动态模型为政策培训模拟轨迹。通过整合这两个改进,我们引入了MapGo框架(目标导向任务的最佳辅助政策优化 ) 。 在我们的实验中,我们首先展示FGI战略与后视模型相比的有效性,然后显示“MapGo”框架在一组复杂任务上与无模型基线相比,实现了更高的样本效率。