Designing off-policy reinforcement learning algorithms is typically a very challenging task, because a desirable iteration update often involves an expectation over an on-policy distribution. Prior off-policy actor-critic (AC) algorithms have introduced a new critic that uses the density ratio for adjusting the distribution mismatch in order to stabilize the convergence, but at the cost of potentially introducing high biases due to the estimation errors of both the density ratio and value function. In this paper, we develop a doubly robust off-policy AC (DR-Off-PAC) for discounted MDP, which can take advantage of learned nuisance functions to reduce estimation errors. Moreover, DR-Off-PAC adopts a single timescale structure, in which both actor and critics are updated simultaneously with constant stepsize, and is thus more sample efficient than prior algorithms that adopt either two timescale or nested-loop structure. We study the finite-time convergence rate and characterize the sample complexity for DR-Off-PAC to attain an $\epsilon$-accurate optimal policy. We also show that the overall convergence of DR-Off-PAC is doubly robust to the approximation errors that depend only on the expressive power of approximation functions. To the best of our knowledge, our study establishes the first overall sample complexity analysis for a single time-scale off-policy AC algorithm.
翻译:设计非政策强化学习算法通常是一项极具挑战性的任务,因为理想的迭代更新往往涉及对政策分配的期待。 先前的非政策行为者-批评(AC)算法引入了一个新的批评器,使用密度比率调整分布错配以稳定趋同,但代价是,由于密度比和值值功能的估计错误,可能会引入高偏差。 在本文中,我们为折扣的 MDP开发了一种双倍强大的非政策AC(DR-off-PAC), 它可以利用所学的扰动功能来减少估计错误。 此外, DR-Office-PAC(AC)采用了单一的时间尺度结构,在这个结构中,行为者和批评者都同时进行不断升级,从而比以前采用两种时间尺度或嵌套路结构的算法效率更高。 我们研究了有限时间趋同率和DR-D-PAC(DR-DR-D)的样本复杂性, 以便达到美元- 准确的最佳政策。 我们还表明,DR-A-D-DA-A-P 最接近性模型分析的精确性模型分析只能确定我们最精确的精确的精确的精确性分析。