Effective offline RL methods require properly handling out-of-distribution actions. Implicit Q-learning (IQL) addresses this by training a Q-function using only dataset actions through a modified Bellman backup. However, it is unclear which policy actually attains the values represented by this implicitly trained Q-function. In this paper, we reinterpret IQL as an actor-critic method by generalizing the critic objective and connecting it to a behavior-regularized implicit actor. This generalization shows how the induced actor balances reward maximization and divergence from the behavior policy, with the specific loss choice determining the nature of this tradeoff. Notably, this actor can exhibit complex and multimodal characteristics, suggesting issues with the conditional Gaussian actor fit with advantage weighted regression (AWR) used in prior methods. Instead, we propose using samples from a diffusion parameterized behavior policy and weights computed from the critic to then importance sampled our intended policy. We introduce Implicit Diffusion Q-learning (IDQL), combining our general IQL critic with the policy extraction method. IDQL maintains the ease of implementation of IQL while outperforming prior offline RL methods and demonstrating robustness to hyperparameters. Code is available at https://github.com/philippe-eecs/IDQL.
翻译:有效的离线RL方法需要适当处理分布外动作。隐式Q学习(IQL)通过使用修改后的Bellman备份仅使用数据集动作来训练Q函数来解决这个问题。但是,实际上实现这种隐式训练的Q函数的策略是不清楚的。在本文中,我们通过推广评论者目标将IQL重新解释为演员评论者方法,并将其连接到行为规范化的隐式演员。这种广义表示表明感应的演员平衡奖励最大化和与行为策略的发散之间的关系,特定的损失选择决定这种权衡的性质。值得注意的是,这个演员可以表现出复杂的多模态特性,表明了之前方法中利用带权回归(AWR)进行优势加权的条件高斯演员拟合存在问题。相反,我们建议使用来自扩散参数化行为策略的样本和从评论者计算的权重来重要采样我们的预期策略。我们介绍了隐式扩散Q学习(IDQL),将我们的通用IQL评论员与策略抽取方法相结合。IDQL保持了IQL易于实现的特点,同时优于以前的离线RL方法,并且表现出对超参数的鲁棒性。可在https://github.com/philippe-eecs/IDQL中找到代码。