In this paper, we study the convergence properties of off-policy policy improvement algorithms with state-action density ratio correction under function approximation setting, where the objective function is formulated as a max-max-min optimization problem. We characterize the bias of the learning objective and present two strategies with finite-time convergence guarantees. In our first strategy, we present algorithm P-SREDA with convergence rate $O(\epsilon^{-3})$, whose dependency on $\epsilon$ is optimal. In our second strategy, we propose a new off-policy actor-critic style algorithm named O-SPIM. We prove that O-SPIM converges to a stationary point with total complexity $O(\epsilon^{-4})$, which matches the convergence rate of some recent actor-critic algorithms in the on-policy setting.
翻译:在本文中,我们研究了非政策性政策改进算法与功能近似设置下国家-行动密度比率校正的趋同特性,在功能近似设置下,目标函数被设定为最大最大最大最大最大最大最大优化问题。我们将学习目标的偏向特征定性为提出两种战略,并提供有限时间趋同保证。在我们的第一项战略中,我们提出了P-SREDA算法,其趋同率为$O(\epsilon ⁇ 3}$(o),其对美元的依赖是最佳的。在第二项战略中,我们提出了一个新的非政策性行为者-批评风格算法,名为O-SPIM。我们证明O-SPIM与总复杂性为$O(\epsilon ⁇ 4}美元($)的固定点相匹配,这与政策环境中最近一些行为者-批评性算法的趋同率相当。