In this paper, we provide finite-sample convergence guarantees for an off-policy variant of the natural actor-critic (NAC) algorithm based on Importance Sampling. In particular, we show that the algorithm converges to a global optimal policy with a sample complexity of $\mathcal{O}(\epsilon^{-3}\log^2(1/\epsilon))$ under an appropriate choice of stepsizes. In order to overcome the issue of large variance due to Importance Sampling, we propose the $Q$-trace algorithm for the critic, which is inspired by the V-trace algorithm (Espeholt et al., 2018). This enables us to explicitly control the bias and variance, and characterize the trade-off between them. As an advantage of off-policy sampling, a major feature of our result is that we do not need any additional assumptions, beyond the ergodicity of the Markov chain induced by the behavior policy.
翻译:在本文中,我们为基于重要性抽样的自然行为者-批评算法的非政策变式提供了有限范围的趋同保证。 特别是, 我们显示算法与全球最佳政策相融合,其样本复杂性为$\mathcal{O}(\\ psilon ⁇ -3 ⁇ log/2/1/\ epsilon) 美元,并按适当的分级选择进行。 为了克服因重要性抽样造成的巨大差异问题,我们提议了由V-trace算法(Espeholt等人,2018年)启发的批评家的Q$- trace算法。 这使我们能够明确控制偏差和差异,并确定它们之间的取舍。 作为非政策抽样的优势,我们的结果的一个主要特征是,除了行为政策引发的马尔科夫链的异性外,我们不需要任何额外的假设。