Pessimism is of great importance in offline reinforcement learning (RL). One broad category of offline RL algorithms fulfills pessimism by explicit or implicit behavior regularization. However, most of them only consider policy divergence as behavior regularization, ignoring the effect of how the offline state distribution differs with that of the learning policy, which may lead to under-pessimism for some states and over-pessimism for others. Taking account of this problem, we propose a principled algorithmic framework for offline RL, called \emph{State-Aware Proximal Pessimism} (SA-PP). The key idea of SA-PP is leveraging discounted stationary state distribution ratios between the learning policy and the offline dataset to modulate the degree of behavior regularization in a state-wise manner, so that pessimism can be implemented in a more appropriate way. We first provide theoretical justifications on the superiority of SA-PP over previous algorithms, demonstrating that SA-PP produces a lower suboptimality upper bound in a broad range of settings. Furthermore, we propose a new algorithm named \emph{State-Aware Conservative Q-Learning} (SA-CQL), by building SA-PP upon representative CQL algorithm with the help of DualDICE for estimating discounted stationary state distribution ratios. Extensive experiments on standard offline RL benchmark show that SA-CQL outperforms the popular baselines on a large portion of benchmarks and attains the highest average return.
翻译:在离线强化学习(RL)中,悲观是十分重要的。一个广泛的离线RL算法类别通过明示或隐含的行为规范化实现了悲观主义。然而,多数人认为政策差异只是行为规范化,忽视了离线州分布与学习政策差异的影响,这可能导致一些国家的悲观度低,而另一些国家则过于悲观。考虑到这一问题,我们提议了一个离线RL(称为\emph{国家-Aware Proximal Pessimism} (SA-PP))的原则性逻辑框架。SA-P的主要想法是利用学习政策与离线数据集之间的折扣性固定状态分配比率作为学习政策与离线数据集之间的折现性行为规范化,从而忽略了离线状态州分配的影响,从而可能以更适当的方式实施悲观主义。我们首先从理论上解释SA-PP优于以往算法的优越性,表明SA-PPPD在广泛的环境中产生较低的亚性亚缩性最高约束。此外,我们提议用名为REDQ(REalalal-alalalalalalalalalalal) 比例来显示SA-alimalimal-alimal-al-al Qalimalimalimalimaltial Qal Qal SAAL SAAL SAAL SAAL SAAL ASal_Sal_SA) 标准比标值的新的递离SAL 标准性基准级比率比率比率,以SAL 。