We present Planning as Descent (PaD), a framework for offline goal-conditioned reinforcement learning that grounds trajectory synthesis in verification. Instead of learning a policy or explicit planner, PaD learns a goal-conditioned energy function over entire latent trajectories, assigning low energy to feasible, goal-consistent futures. Planning is realized as gradient-based refinement in this energy landscape, using identical computation during training and inference to reduce train-test mismatch common in decoupled modeling pipelines. PaD is trained via self-supervised hindsight goal relabeling, shaping the energy landscape around the planning dynamics. At inference, multiple trajectory candidates are refined under different temporal hypotheses, and low-energy plans balancing feasibility and efficiency are selected. We evaluate PaD on OGBench cube manipulation tasks. When trained on narrow expert demonstrations, PaD achieves state-of-the-art 95\% success, strongly outperforming prior methods that peak at 68\%. Remarkably, training on noisy, suboptimal data further improves success and plan efficiency, highlighting the benefits of verification-driven planning. Our results suggest learning to evaluate and refine trajectories provides a robust alternative to direct policy learning for offline, reward-free planning.
翻译:我们提出了规划即下降(PaD),一种用于离线目标条件强化学习的框架,将轨迹合成建立在验证基础上。PaD不学习策略或显式规划器,而是学习一个覆盖整个潜在轨迹的目标条件能量函数,为可行且符合目标的未来赋予低能量值。规划通过在该能量景观中进行基于梯度的优化来实现,在训练和推理阶段使用相同的计算过程,以减少解耦建模流程中常见的训练-测试失配问题。PaD通过自监督的后见目标重标注进行训练,围绕规划动态塑造能量景观。在推理阶段,多个轨迹候选在不同时间假设下进行优化,并选择平衡可行性与效率的低能量规划方案。我们在OGBench立方体操作任务上评估PaD。当使用狭窄专家示范数据训练时,PaD实现了95%的最先进成功率,显著优于此前峰值成功率仅为68%的方法。值得注意的是,使用噪声次优数据训练进一步提升了成功率和规划效率,凸显了验证驱动规划的优势。我们的结果表明,学习评估和优化轨迹为离线无奖励规划提供了一条比直接策略学习更稳健的替代路径。