We propose a new policy representation based on score-based diffusion models (SDMs). We apply our new policy representation in the domain of Goal-Conditioned Imitation Learning (GCIL) to learn general-purpose goal-specified policies from large uncurated datasets without rewards. Our new goal-conditioned policy architecture "$\textbf{BE}$havior generation with $\textbf{S}$c$\textbf{O}$re-based Diffusion Policies" (BESO) leverages a generative, score-based diffusion model as its policy. BESO decouples the learning of the score model from the inference sampling process, and, hence allows for fast sampling strategies to generate goal-specified behavior in just 3 denoising steps, compared to 30+ steps of other diffusion based policies. Furthermore, BESO is highly expressive and can effectively capture multi-modality present in the solution space of the play data. Unlike previous methods such as Latent Plans or C-Bet, BESO does not rely on complex hierarchical policies or additional clustering for effective goal-conditioned behavior learning. Finally, we show how BESO can even be used to learn a goal-independent policy from play-data using classifier-free guidance. To the best of our knowledge this is the first work that a) represents a behavior policy based on such a decoupled SDM b) learns an SDM based policy in the domain of GCIL and c) provides a way to simultaneously learn a goal-dependent and a goal-independent policy from play-data. We evaluate BESO through detailed simulation and show that it consistently outperforms several state-of-the-art goal-conditioned imitation learning methods on challenging benchmarks. We additionally provide extensive ablation studies and experiments to demonstrate the effectiveness of our method for effective goal-conditioned behavior generation.
翻译:我们提出了一种基于得分扩散模型 (SDM) 的新政策表示法。我们将我们的新策略表示法应用于目标导向模仿学习 (GCIL) 领域,以从大规模未经整理的数据集中学习通用的目标指定策略,而无需奖励。我们的新目标条件策略架构“$\textbf{BE}$havior generation with $\textbf{S}$c$\textbf{O}$re-based Diffusion Policies” (BESO) 利用生成的得分扩散模型作为其策略。BESO 将得分模型的学习与推理抽样过程分离开来,因此可以通过仅经过 3 次去噪步骤便能使用快速采样策略生成目标指定的行为,相比其他扩散策略的 30 多步更加高效。此外,BESO 非常表现力,可以有效地捕捉解空间中存在的多模态。与之前的方法如隐式规划或 C-Bet 不同,BESO 不依赖于复杂的分层策略或额外的聚类,可以实现有效的目标条件行为学习。最后,我们展示了 BESO 如何甚至可以使用无分类器导向的指导,从游戏数据中学习目标无关策略。据我们所知,这是第一个将行为策略表示为此种分离式 SDM 的工作,第一个在 GCIL 领域中学习 SDM 策略的工作,并且提供了一种同时从游戏数据中学习目标相关和目标无关策略的方法。我们通过详细的模拟评估了 BESO,并展示了其在具有挑战性的基准测试中始终优于几种最先进的目标导向模仿学习方法。我们还提供了广泛的消融研究和实验,以证明我们的方法对于有效的目标导向行为生成的有效性。