Offline reinforcement learning (RL), which aims to learn an optimal policy using a previously collected static dataset, is an important paradigm of RL. Standard RL methods often perform poorly at this task due to the function approximation errors on out-of-distribution actions. While a variety of regularization methods have been proposed to mitigate this issue, they are often constrained by policy classes with limited expressiveness and sometimes result in substantially suboptimal solutions. In this paper, we propose Diffusion-QL that utilizes a conditional diffusion model as a highly expressive policy class for behavior cloning and policy regularization. In our approach, we learn an action-value function and we add a term maximizing action-values into the training loss of a conditional diffusion model, which results in a loss that seeks optimal actions that are near the behavior policy. We show the expressiveness of the diffusion model-based policy and the coupling of the behavior cloning and policy improvement under the diffusion model both contribute to the outstanding performance of Diffusion-QL. We illustrate our method and prior work in a simple 2D bandit example with a multimodal behavior policy. We then show that our method can achieve state-of-the-art performance on the majority of the D4RL benchmark tasks for offline RL.
翻译:离线强化学习(RL)旨在利用先前收集的静态数据集学习最佳政策,是RL的一个重要范例。标准RL方法由于分配外行动功能近似错误,往往在这项任务中表现不佳。虽然提出了各种正规化方法来缓解这一问题,但往往受到政策类别的限制,因为政策类别表现有限,有时导致极不理想的解决办法。在本文件中,我们提议Difl-QL使用有条件的传播模式作为行为克隆和政策正规化的高度表达性政策类别。在我们的方法中,我们学习了一种行动价值功能,并在有条件扩散模式的培训损失中增加了一个“最大行动价值”一词,从而导致损失,寻求接近行为政策的最佳行动。我们展示了基于传播模式的政策的清晰度,以及行为克隆和政策改进在扩散模式下的组合,都有助于Difilive-QL的出色表现。我们用一个简单的2D波段示例来说明我们的方法和以前的工作,并用一种多式联运行为政策来说明我们的方法和我们以前的工作。我们随后展示了我们的方法可以在DR4基准中达到的状态。