In this paper, we develop a novel variant of off-policy natural actor-critic algorithm with linear function approximation and we establish a sample complexity of $\mathcal{O}(\epsilon^{-3})$, outperforming all the previously known convergence bounds of such algorithms. In order to overcome the divergence due to deadly triad in off-policy policy evaluation under function approximation, we develop a critic that employs $n$-step TD-learning algorithm with a properly chosen $n$. We present finite-sample convergence bounds on this critic under both constant and diminishing step sizes, which are of independent interest. Furthermore, we develop a variant of natural policy gradient under function approximation, with an improved convergence rate of $\mathcal{O}(1/T)$ after $T$ iterations. Combining the finite sample error bounds of actor and the critic, we obtain the $\mathcal{O}(\epsilon^{-3})$ sample complexity. We derive our sample complexity bounds solely based on the assumption that the behavior policy sufficiently explores all the states and actions, which is a much lighter assumption compared to the related literature.
翻译:在本文中,我们开发了一种新型的非政策性自然行为方—批评算法的变体,配有线性函数近似值,我们建立了美元(mathcal{O}(\\ epsilon}-3})的样本复杂性,优于所有先前已知的这种算法的趋同界限。为了克服在函数近似值下对非政策政策评价的致命三重差造成的差异,我们开发了一位评论家,用一个适当选择的美元来使用一个零位TD-学习算法。我们展示了该评论家在固定和不断缩小的职级大小下都具有一定的趋同性。此外,我们在功能近似下开发了一个自然政策梯度的变体,比美元($mathcal{O}/(1/T)的相近率提高了。结合了演员和评论家的有限抽样差差差,我们获得了美元(mathcal{O}(\ epsilon)的样本复杂性。 我们的样本复杂性只是基于这样的假设,即行为政策充分探索了所有州和行动,这与相关文献相比是一个轻得多的假设。