光观和近似政策评价在与线性价值职能相近的政策迭接中的作用 (The Role of Lookahead and Approximate Policy Evaluation in Policy Iteration with Linear Value Function Approximation)

When the sizes of the state and action spaces are large, solving MDPs can be computationally prohibitive even if the probability transition matrix is known. So in practice, a number of techniques are used to approximately solve the dynamic programming problem, including lookahead, approximate policy evaluation using an m-step return, and function approximation. In a recent paper, (Efroni et al. 2019) studied the impact of lookahead on the convergence rate of approximate dynamic programming. In this paper, we show that these convergence results change dramatically when function approximation is used in conjunction with lookout and approximate policy evaluation using an m-step return. Specifically, we show that when linear function approximation is used to represent the value function, a certain minimum amount of lookahead and multi-step return is needed for the algorithm to even converge. And when this condition is met, we characterize the finite-time performance of policies obtained using such approximate policy iteration. Our results are presented for two different procedures to compute the function approximation: linear least-squares regression and gradient descent.

翻译：当状态和动作空间的大小很大时,即使知道概率转换矩阵,解决 MDP 的计算可能令人望而却步。因此,在实践中,使用一些技术来大致解决动态编程问题,包括外观、使用 m 步返回的近似政策评价以及函数近似。在最近的一篇论文中,(Efroni 等人, 2019)研究了外观对近似动态编程的趋同率的影响。在本文中,我们显示当函数近似与图示和使用 m 步返回的近似政策评价同时使用时,这些趋同率会发生巨大变化。具体地说,我们表明当使用线性函数近似来代表值函数时,需要一定数量的外观和多步返回才能使算法趋同。在满足这一条件时,我们用这种近似政策 Iteration 来描述获得的政策的有限时间性表现。我们的结果用于两种不同的计算函数近似值的程序:线性最小值回归和梯度下降。