Actor-critic methods are widely used in offline reinforcement learning practice, but are not so well-understood theoretically. We propose a new offline actor-critic algorithm that naturally incorporates the pessimism principle, leading to several key advantages compared to the state of the art. The algorithm can operate when the Bellman evaluation operator is closed with respect to the action value function of the actor's policies; this is a more general setting than the low-rank MDP model. Despite the added generality, the procedure is computationally tractable as it involves the solution of a sequence of second-order programs. We prove an upper bound on the suboptimality gap of the policy returned by the procedure that depends on the data coverage of any arbitrary, possibly data dependent comparator policy. The achievable guarantee is complemented with a minimax lower bound that is matching up to logarithmic factors.
翻译:在离线强化学习实践中广泛使用Actor-critic 方法,但在理论上并不十分清楚。 我们提议一种新的离线的演员-critic 算法,该算法自然地纳入了悲观主义原则,与最新技术相比具有若干关键优势。 当Bellman评价操作员对行为者政策的行动价值功能关闭时,算法可以发挥作用;这是比低级MDP模式更一般的设置。 尽管增加了一般性,但该程序在计算上是可移动的,因为它涉及二阶程序序列的解决方案。我们证明,根据任何任意的、可能依赖数据的参照方政策的数据覆盖范围,该程序所返回的政策的亚优性差距是高度的。可以实现的保证可以用一个与对数因素相匹配的小型负号更低的界限来补充。