We propose a new approach for solving planning problems with a hierarchical structure, fusing reinforcement learning and MPC planning. Our formulation tightly and elegantly couples the two planning paradigms. It leverages reinforcement learning actions to inform the MPPI sampler, and adaptively aggregates MPPI samples to inform the value estimation. The resulting adaptive process leverages further MPPI exploration where value estimates are uncertain, and improves training robustness and the overall resulting policies. This results in a robust planning approach that can handle complex planning problems and easily adapts to different applications, as demonstrated over several domains, including race driving, modified Acrobot, and Lunar Lander with added obstacles. Our results in these domains show better data efficiency and overall performance in terms of both rewards and task success, with up to a 72% increase in success rate compared to existing approaches, as well as accelerated convergence (x2.1) compared to non-adaptive sampling.
翻译:我们提出了一种解决具有分层结构的规划问题的新方法,该方法融合了强化学习与MPC规划。我们的公式将这两种规划范式紧密而优雅地耦合在一起。它利用强化学习的动作来指导MPPI采样器,并自适应地聚合MPPI样本来优化价值估计。由此产生的自适应过程在价值估计不确定的区域进一步利用MPPI进行探索,从而提升了训练鲁棒性及最终策略的整体性能。这形成了一种鲁棒的规划方法,能够处理复杂的规划问题,并易于适应不同的应用场景,正如在多个领域(包括赛车驾驶、改进的Acrobot以及添加障碍物的Lunar Lander)所展示的那样。我们在这些领域中的实验结果表明,相较于现有方法,本方法在数据效率和整体性能(包括奖励与任务成功率)上均表现更优,成功率最高提升达72%,与非自适应采样方法相比,收敛速度也加快了2.1倍。