将传播政策作为离线强化学习的表达政策级 (Diffusion Policies as an Expressive Policy Class for Offline Reinforcement Learning)

Offline reinforcement learning (RL), which aims to learn an optimal policy using a previously collected static dataset, is an important paradigm of RL. Standard RL methods often perform poorly in this regime due to the function approximation errors on out-of-distribution actions. While a variety of regularization methods have been proposed to mitigate this issue, they are often constrained by policy classes with limited expressiveness that can lead to highly suboptimal solutions. In this paper, we propose representing the policy as a diffusion model, a recent class of highly-expressive deep generative models. We introduce Diffusion Q-learning (Diffusion-QL) that utilizes a conditional diffusion model to represent the policy. In our approach, we learn an action-value function and we add a term maximizing action-values into the training loss of the conditional diffusion model, which results in a loss that seeks optimal actions that are near the behavior policy. We show the expressiveness of the diffusion model-based policy, and the coupling of the behavior cloning and policy improvement under the diffusion model both contribute to the outstanding performance of Diffusion-QL. We illustrate the superiority of our method compared to prior works in a simple 2D bandit example with a multimodal behavior policy. We then show that our method can achieve state-of-the-art performance on the majority of the D4RL benchmark tasks.

翻译：离线强化学习(RL)旨在利用先前收集的静态数据集学习最佳政策,是RL标准RL方法的一个重要范例。标准RL方法在这一制度中往往表现不佳,因为分配外行动功能近似错误。虽然提出了各种正规化方法来缓解这一问题,但往往受到政策类别的限制,政策类别不够清晰,可能导致极不理想的解决方案。在本文中,我们提议将政策作为扩散模式代表,这是最近一类高度表现深刻的基因化模型。我们引入了“扩散Q”学习(Difil-QL)方法,利用有条件的传播模式来代表政策。在我们的方法中,我们学习了一个行动价值功能,并将“行动价值最大化”一词添加到有条件推广模式的培训损失中,从而导致损失,从而寻求接近行为政策的最佳行动。我们展示了基于推广模式的政策的清晰度,以及行为克隆和政策改进在传播模式下的结合,这两类方法都有助于Difl-QL的出色表现。我们学习了一个行动-价值的有条件传播模式,我们用“行动-价值最大化”的术语来说明我们之前的“Mrod 方法的优于“Mastial-hal ”的“我们之前的“业绩方法”的优于“Bastical 方法” 。我们之前的多数方法,我们可以展示了“Bastistral-hromod-hrogal 方法,可以显示我们之前的“业绩”方法的优优于“Broal-bal-horal-horal-horg-horal-h-h-hrog-hrogal-hrod-hrog-hrod-hrog-h-h-h-hrog-hrogy-hrod-hrogy-hrogy-h-h-h-h-h-h-h-h-h-h-h-h-h-h-h-h-h-h-h-hrogal-h-h-h-h-h-h-h-h-h-h-h-h-h-h-h-h-h-h-h-h-h-h-h-h-h-h-h-h-h-h-h-h-h-h-