We study the sample complexity of reducing reinforcement learning to a sequence of empirical risk minimization problems over the policy space. Such reductions-based algorithms exhibit local convergence in the function space, as opposed to the parameter space for policy gradient algorithms, and thus are unaffected by the possibly non-linear or discontinuous parameterization of the policy class. We propose a variance-reduced variant of Conservative Policy Iteration that improves the sample complexity of producing a $\varepsilon$-functional local optimum from $O(\varepsilon^{-4})$ to $O(\varepsilon^{-3})$. Under state-coverage and policy-completeness assumptions, the algorithm enjoys $\varepsilon$-global optimality after sampling $O(\varepsilon^{-2})$ times, improving upon the previously established $O(\varepsilon^{-3})$ sample requirement.
翻译:我们研究了将强化学习降低到一系列在政策空间方面尽量减少风险的经验性问题的抽样复杂性。这种基于削减的算法在功能空间中表现出当地趋同,而不是政策梯度算法的参数空间,因此不受政策类别可能非线性或不连续参数化的影响。我们提议了一个减少差异的保守政策循环变式,将生产美元(varepsilon)-功能最佳当地价格的抽样复杂性从O(\ varepsilon)美元提高到O(\ varepsilon}-3})美元。根据国家覆盖和政策完整性假设,在取样美元( \ varepsilon}-2})之后,该算法享有瓦列普斯隆-全球最佳性,比以前确定的美元( varepsilon}-3}抽样要求有所改善。