We present a novel approach for fast and reliable policy selection for navigation in partial maps. Leveraging the recent learning-augmented model-based Learning over Subgoals Planning (LSP) abstraction to plan, our robot reuses data collected during navigation to evaluate how well other alternative policies could have performed via a procedure we call offline alt-policy replay. Costs from offline alt-policy replay constrain policy selection among the LSP-based policies during deployment, allowing for improvements in convergence speed, cumulative regret and average navigation cost. With only limited prior knowledge about the nature of unseen environments, we achieve at least 67% and as much as 96% improvements on cumulative regret over the baseline bandit approach in our experiments in simulated maze and office-like environments.
翻译:本文提出了一种快速可靠的局部地图导航策略选择方法。利用最近的学习增强的基于子目标规划的模型学习 (LSP) 抽象来进行规划,我们的机器人利用收集到的导航数据通过一种被称为离线替代策略重演的过程来评估其他替代策略的表现,以此来约束部署过程中 LSP-based 策略之间的策略选择,从而提高收敛速度、累积遗憾和平均导航成本。本文在模拟的迷宫和办公室类环境实验中取得了至少 67% 和多达 96% 的累计遗憾值改进,相比基线赌博机方法。即使在对未知环境的性质只有有限的先验知识的情况下,本文方法也取得了良好的性能。