A promising way to improve the sample efficiency of reinforcement learning is model-based methods, in which many explorations and evaluations can happen in the learned models to save real-world samples. However, when the learned model has a non-negligible model error, sequential steps in the model are hard to be accurately evaluated, limiting the model's utilization. This paper proposes to alleviate this issue by introducing multi-step plans to replace multi-step actions for model-based RL. We employ the multi-step plan value estimation, which evaluates the expected discounted return after executing a sequence of action plans at a given state, and updates the policy by directly computing the multi-step policy gradient via plan value estimation. The new model-based reinforcement learning algorithm MPPVE (Model-based Planning Policy Learning with Multi-step Plan Value Estimation) shows a better utilization of the learned model and achieves a better sample efficiency than state-of-the-art model-based RL approaches.
翻译:提高强化学习抽样效率的一个有希望的方法是以模型为基础的方法,在这个方法中,许多探索和评价可以在为拯救真实世界样本而学习到的模型中进行,然而,当学习到的模型有一个不可忽略的模式错误时,很难准确评价模型的相继步骤,从而限制模型的利用。本文件建议通过采用多步骤计划来缓解这一问题,以取代基于模型的RL的多步骤行动。我们采用多步骤计划价值估计,评估在特定国家执行一系列行动计划后预期的折扣回报率,并通过计划价值估计直接计算多步政策梯度来更新政策。基于模型的新的强化学习算法MPPVE(基于多步计划的规划政策学习模型的数值动因)表明更好地利用了学习模型,并比基于最先进的模型的方法提高了抽样效率。