基于得分扩散策略的目标导向模仿学习 (Goal-Conditioned Imitation Learning using Score-based Diffusion Policies)

We propose a new policy representation based on score-based diffusion models (SDMs). We apply our new policy representation in the domain of Goal-Conditioned Imitation Learning (GCIL) to learn general-purpose goal-specified policies from large uncurated datasets without rewards. Our new goal-conditioned policy architecture "$\textbf{BE}$havior generation with $\textbf{S}$c$\textbf{O}$re-based Diffusion Policies" (BESO) leverages a generative, score-based diffusion model as its policy. BESO decouples the learning of the score model from the inference sampling process, and, hence allows for fast sampling strategies to generate goal-specified behavior in just 3 denoising steps, compared to 30+ steps of other diffusion based policies. Furthermore, BESO is highly expressive and can effectively capture multi-modality present in the solution space of the play data. Unlike previous methods such as Latent Plans or C-Bet, BESO does not rely on complex hierarchical policies or additional clustering for effective goal-conditioned behavior learning. Finally, we show how BESO can even be used to learn a goal-independent policy from play-data using classifier-free guidance. To the best of our knowledge this is the first work that a) represents a behavior policy based on such a decoupled SDM b) learns an SDM based policy in the domain of GCIL and c) provides a way to simultaneously learn a goal-dependent and a goal-independent policy from play-data. We evaluate BESO through detailed simulation and show that it consistently outperforms several state-of-the-art goal-conditioned imitation learning methods on challenging benchmarks. We additionally provide extensive ablation studies and experiments to demonstrate the effectiveness of our method for effective goal-conditioned behavior generation.

翻译：我们提出了一种基于得分扩散模型 (SDM) 的新政策表示法。我们将我们的新策略表示法应用于目标导向模仿学习 (GCIL) 领域，以从大规模未经整理的数据集中学习通用的目标指定策略，而无需奖励。我们的新目标条件策略架构“$\textbf{BE}$havior generation with $\textbf{S}$c$\textbf{O}$re-based Diffusion Policies” (BESO) 利用生成的得分扩散模型作为其策略。BESO 将得分模型的学习与推理抽样过程分离开来，因此可以通过仅经过 3 次去噪步骤便能使用快速采样策略生成目标指定的行为，相比其他扩散策略的 30 多步更加高效。此外，BESO 非常表现力，可以有效地捕捉解空间中存在的多模态。与之前的方法如隐式规划或 C-Bet 不同，BESO 不依赖于复杂的分层策略或额外的聚类，可以实现有效的目标条件行为学习。最后，我们展示了 BESO 如何甚至可以使用无分类器导向的指导，从游戏数据中学习目标无关策略。据我们所知，这是第一个将行为策略表示为此种分离式 SDM 的工作，第一个在 GCIL 领域中学习 SDM 策略的工作，并且提供了一种同时从游戏数据中学习目标相关和目标无关策略的方法。我们通过详细的模拟评估了 BESO，并展示了其在具有挑战性的基准测试中始终优于几种最先进的目标导向模仿学习方法。我们还提供了广泛的消融研究和实验，以证明我们的方法对于有效的目标导向行为生成的有效性。

相关内容

SDM

关注 11

数据挖掘是从数据中发现有价值的知识的计算过程，是现代数据科学的核心。它在许多领域有着巨大的应用，包括科学、工程、医疗保健、商业和医学。这些字段中的典型数据集是大的、复杂的，而且通常是有噪声的。从这些数据集中提取知识需要使用复杂的、高性能的、有原则的分析技术和算法。这些技术反过来又需要在高性能计算基础设施上的实现，这些基础设施需要经过仔细的性能调优。强大的可视化技术和有效的用户界面对于使数据挖掘工具吸引来自不同学科的研究人员、分析师、数据科学家和应用程序开发人员以及利益相关者的可用性也至关重要。SDM确立了自己在数据挖掘领域的领先地位，并为解决这些问题的研究人员提供了一个在同行评审论坛上展示其工作的场所。SDM强调原则方法和坚实的数学基础，以其高质量和高影响力的技术论文而闻名，并提供强大的研讨会和教程程序(包括在会议注册中)。官网地址：http://dblp.uni-trier.de/db/conf/sdm/

DiffRec: 扩散推荐模型（SIGIR'23）

专知会员服务

48+阅读 · 2023年4月16日

【“大量”智能体的强化学习】《Many-Agent Reinforcement Learning》，327页博士论文，伦敦大学学院（UCL）

专知会员服务

118+阅读 · 2022年5月7日

【MIla】一种意识启发规划的基于模型强化学习，A Consciousness-Inspired Planning Agent for Model-Based Reinforcement Learning

专知会员服务

23+阅读 · 2022年3月19日

【伯克利JD Co-Reyes博士论文】建立强化学习算法泛化:从潜在动力学模型到元学习，Building Reinforcement Learning Algorithms that Generalize: From Latent Dynamics Models to Meta-Learning

专知会员服务

45+阅读 · 2022年3月6日