Planning with pretrained diffusion models has emerged as a promising approach for solving test-time guided control problems. Standard gradient guidance typically performs optimally under convex, differentiable reward landscapes. However, it shows substantially reduced effectiveness in real-world scenarios with non-convex objectives, non-differentiable constraints, and multi-reward structures. Furthermore, recent supervised planning approaches require task-specific training or value estimators, which limits test-time flexibility and zero-shot generalization. We propose a Tree-guided Diffusion Planner (TDP), a zero-shot test-time planning framework that balances exploration and exploitation through structured trajectory generation. We frame test-time planning as a tree search problem using a bi-level sampling process: (1) diverse parent trajectories are produced via training-free particle guidance to encourage broad exploration, and (2) sub-trajectories are refined through fast conditional denoising guided by task objectives. TDP addresses the limitations of gradient guidance by exploring diverse trajectory regions and harnessing gradient information across this expanded solution space using only pretrained models and test-time reward signals. We evaluate TDP on three diverse tasks: maze gold-picking, robot arm block manipulation, and AntMaze multi-goal exploration. TDP consistently outperforms state-of-the-art approaches on all tasks. The project page can be found at: https://tree-diffusion-planner.github.io.
翻译:利用预训练扩散模型进行规划已成为解决测试时引导控制问题的一种有前景的方法。标准梯度引导通常在凸可微的奖励函数下表现最优,但在具有非凸目标、不可微约束和多奖励结构的现实场景中,其有效性显著降低。此外,最近的监督规划方法需要任务特定训练或价值估计器,这限制了测试时的灵活性和零样本泛化能力。我们提出树引导扩散规划器(TDP),一种零样本测试时规划框架,通过结构化轨迹生成平衡探索与利用。我们将测试时规划构建为使用双层采样过程的树搜索问题:(1)通过免训练的粒子引导生成多样化的父轨迹以促进广泛探索,(2)基于任务目标引导的快速条件去噪对子轨迹进行细化。TDP通过探索多样化轨迹区域,并仅利用预训练模型和测试时奖励信号在扩展的解空间中利用梯度信息,解决了梯度引导的局限性。我们在三个不同任务上评估TDP:迷宫黄金拾取、机器人手臂方块操作和AntMaze多目标探索。TDP在所有任务上均持续优于最先进方法。项目页面可访问:https://tree-diffusion-planner.github.io。