This paper addresses the problem of reliably and efficiently solving broad classes of long-horizon stochastic path planning problems. Starting with a vanilla RL formulation with a stochastic dynamics simulator and an occupancy matrix of the environment, our approach computes useful options with policies as well as high-level paths that compose the discovered options. Our main contributions are (1) data-driven methods for creating abstract states that serve as endpoints for helpful options, (2) methods for computing option policies using auto-generated option guides in the form of dense pseudo-reward functions, and (3) an overarching algorithm for composing the computed options. We show that this approach yields strong guarantees of executability and solvability: under fairly general conditions, the computed option guides lead to composable option policies and consequently ensure downward refinability. Empirical evaluation on a range of robots, environments, and tasks shows that this approach effectively transfers knowledge across related tasks and that it outperforms existing approaches by a significant margin.
翻译:本文探讨了可靠和高效地解决长视随机路径规划问题等广泛类别的问题。从香草RL配方开始,配有随机动态模拟器和环境占用矩阵,我们的方法用政策和高层次路径计算出有用的选项,这些选项构成所发现选项。我们的主要贡献是:(1) 数据驱动方法,用以创建作为有用选项终点的抽象状态,(2) 以密集的伪奖励功能形式使用自动生成选项指南计算选项政策的方法,(3) 计算计算计算选项方案的总算法。我们表明,这一方法为可执行性和可溶性提供了强有力的保证:在相当一般的条件下,计算选项指南可产生可调整选项政策,从而保证向下调整。关于一系列机器人、环境和任务的经验性评估表明,这一方法有效地转移了相关任务的知识,并大大超越了现有方法。