Hierarchical Reinforcement Learning (HRL) agents often struggle with long-horizon visual planning due to their reliance on error-prone distance metrics. We propose Discrete Hierarchical Planning (DHP), a method that replaces continuous distance estimates with discrete reachability checks to evaluate subgoal feasibility. DHP recursively constructs tree-structured plans by decomposing long-term goals into sequences of simpler subtasks, using a novel advantage estimation strategy that inherently rewards shorter plans and generalizes beyond training depths. In addition, to address the data efficiency challenge, we introduce an exploration strategy that generates targeted training examples for the planning modules without needing expert data. Experiments in 25-room navigation environments demonstrate a 100% success rate (vs. 90% baseline). We also present an offline variant that achieves state-of-the-art results on OGBench benchmarks, with up to 71% absolute gains on giant HumanoidMaze tasks, demonstrating our core contributions are architecture-agnostic. The method also generalizes to momentum-based control tasks and requires only log N steps for replanning. Theoretical analysis and ablations validate our design choices.
翻译:分层强化学习(HRL)智能体由于依赖易出错的连续距离度量,在长视野视觉规划任务中常常面临困难。我们提出了离散分层规划(DHP)方法,该方法通过离散可达性检查替代连续距离估计来评估子目标可行性。DHP通过将长期目标分解为一系列更简单的子任务,递归地构建树状结构规划,并采用一种新颖的优势估计策略,该策略内在奖励更短的规划路径,并能泛化至超出训练深度的任务。此外,为解决数据效率挑战,我们引入了一种探索策略,无需专家数据即可为规划模块生成有针对性的训练样本。在25房间导航环境中的实验表明,该方法实现了100%的成功率(基线为90%)。我们还提出了一种离线变体,在OGBench基准测试中取得了最先进的结果,在巨型HumanoidMaze任务上实现了高达71%的绝对性能提升,证明了我们的核心贡献与架构无关。该方法还能泛化至基于动量的控制任务,且重规划仅需log N步。理论分析与消融实验验证了我们的设计选择。