Offline reinforcement learning (RL) aims at learning an effective policy from offline datasets without active interactions with the environment. The major challenge of offline RL is the distribution shift that appears when out-of-distribution actions are queried, which makes the policy improvement direction biased by extrapolation errors. Most existing methods address this problem by penalizing the policy for deviating from the behavior policy during policy improvement or making conservative updates for value functions during policy evaluation. In this work, we propose a novel MISA framework to approach offline RL from the perspective of Mutual Information between States and Actions in the dataset by directly constraining the policy improvement direction. Intuitively, mutual information measures the mutual dependence of actions and states, which reflects how a behavior agent reacts to certain environment states during data collection. To effectively utilize this information to facilitate policy learning, MISA constructs lower bounds of mutual information parameterized by the policy and Q-values. We show that optimizing this lower bound is equivalent to maximizing the likelihood of a one-step improved policy on the offline dataset. In this way, we constrain the policy improvement direction to lie in the data manifold. The resulting algorithm simultaneously augments the policy evaluation and improvement by adding a mutual information regularization. MISA is a general offline RL framework that unifies conservative Q-learning (CQL) and behavior regularization methods (e.g., TD3+BC) as special cases. Our experiments show that MISA performs significantly better than existing methods and achieves new state-of-the-art on various tasks of the D4RL benchmark.
翻译:离线强化学习(RL)旨在从离线数据集中学习一项有效的政策,而没有与环境积极互动。离线RL的主要挑战在于,当询问分配外行动时,分配上出现的变化,这使得政策改进方向受到外推错误的偏向。大多数现有方法解决这一问题,在政策改进期间,惩罚政策偏离行为政策,或在政策评价期间保守更新价值功能时,惩罚相互信息。在这项工作中,我们提议一个新的 MISA框架,从国家与行动之间相互信息的角度,从离线RL进入离线RL,直接限制政策改进方向。在直接限制政策改进方向的同时,相互信息衡量行动和状态的相互依存性,这反映了行为代理方在数据收集期间如何对某些环境状态作出反应。为了有效地利用这一信息促进政策学习,MISA构建了政策和Q价值所测量的较低范围。我们显示,优化这一下限相当于最大限度地提高离线数据集的单步改进政策的可能性。在这方面,我们限制政策改进政策的方向是特殊的Q方向,同时显示不断改进的RISA(通过不断改进的常规做法,从而增加共同的RISAL) 的常规评估方法。