Most prior approaches to offline reinforcement learning (RL) utilize \textit{behavior regularization}, typically augmenting existing off-policy actor critic algorithms with a penalty measuring divergence between the policy and the offline data. However, these approaches lack guaranteed performance improvement over the behavior policy. In this work, we start from the performance difference between the learned policy and the behavior policy, we derive a new policy learning objective that can be used in the offline setting, which corresponds to the advantage function value of the behavior policy, multiplying by a state-marginal density ratio. We propose a practical way to compute the density ratio and demonstrate its equivalence to a state-dependent behavior regularization. Unlike state-independent regularization used in prior approaches, this \textit{soft} regularization allows more freedom of policy deviation at high confidence states, leading to better performance and stability. We thus term our resulting algorithm Soft Behavior-regularized Actor Critic (SBAC). Our experimental results show that SBAC matches or outperforms the state-of-the-art on a set of continuous control locomotion and manipulation tasks.
翻译:大部分以往的离线强化学习方法(RL)都使用\ textit{ behavior 正规化 ), 通常会增加现有的离政策外的批评家算法, 其惩罚是衡量政策和离线数据之间的差异。 但是,这些方法缺乏对行为政策的保证性能改善。 在这项工作中, 我们从所学政策和行为政策之间的业绩差异开始, 产生一种新的政策学习目标, 可以在离线设置中使用, 与行为政策的优势值相对应, 乘以州际密度比率。 我们提出了一种实际的方法来计算密度比率, 并展示其与国家独立行为规范的等同性。 与以往方法中使用的离线的离线型正规化不同, 这种clextit{ soft{ soft{ soft} 常规化的Acentor Critic (SSAC), 我们的实验结果表明, SSAC 匹配或超越了一套连续控制流动和操纵任务方面的状态。