In this paper, we provide finite-sample convergence guarantees for an off-policy variant of the natural actor-critic (NAC) algorithm based on Importance Sampling. In particular, we show that the algorithm converges to a global optimal policy with a sample complexity of $\mathcal{O}(\epsilon^{-3}\log^2(1/\epsilon))$ under an appropriate choice of stepsizes. In order to overcome the issue of large variance due to Importance Sampling, we propose the $Q$-trace algorithm for the critic, which is inspired by the V-trace algorithm \cite{espeholt2018impala}. This enables us to explicitly control the bias and variance, and characterize the trade-off between them. As an advantage of off-policy sampling, a major feature of our result is that we do not need any additional assumptions, beyond the ergodicity of the Markov chain induced by the behavior policy.
翻译:在本文中,我们为基于重要性抽样的自然行为者-批评算法的非政策变式提供了有限范围的趋同保证。 特别是, 我们显示, 算法与全球最佳政策相融合, 样本复杂程度为$\ mathcal{O}( \\ psilon ⁇ -3 ⁇ log/2/1/\\ epsilon), 并有适当的步骤选择。 为了克服因重要性抽样造成的巨大差异问题, 我们提议了由 V- trace 算法 \ cite{espeholt2018impala} 所启发的批评者Q$- trace算法。 这使我们能够明确控制偏差和差异, 并描述它们之间的权衡。 作为非政策抽样的优势, 我们结果的一个主要特征是, 我们不需要任何额外的假设, 除了行为政策引发的Markov 链的异性。