Function approximation is widely used in reinforcement learning to handle the computational difficulties associated with very large state spaces. However, function approximation introduces errors which may lead to instabilities when using approximate dynamic programming techniques to obtain the optimal policy. Therefore, techniques such as lookahead for policy improvement and m-step rollout for policy evaluation are used in practice to improve the performance of approximate dynamic programming with function approximation. We quantitatively characterize, for the first time, the impact of lookahead and m-step rollout on the performance of approximate dynamic programming (DP) with function approximation: (i) without a sufficient combination of lookahead and m-step rollout, approximate DP may not converge, (ii) both lookahead and m-step rollout improve the convergence rate of approximate DP, and (iii) lookahead helps mitigate the effect of function approximation and the discount factor on the asymptotic performance of the algorithm. Our results are presented for two approximate DP methods: one which uses least-squares regression to perform function approximation and another which performs several steps of gradient descent of the least-squares objective in each iteration.
翻译:功能近似在强化学习中被广泛用于处理与非常大的国家空间有关的计算困难。然而,功能近似在使用大致动态编程技术以获得最佳政策时可能会造成不稳定性,因此,在实践中,使用诸如政策改进和政策评价M步推出等技术来改进与功能近似动态编程的性能。我们第一次从数量上描述长头和M步推出对近似动态编程(DP)的性能的影响,而功能近似:(一) 在未充分结合外观和中步推出时,近似DP可能无法趋同;(二) 外观和中步推出可提高大致DP的趋同率,以及(三) 外观有助于减轻功能近似效应和折扣因素对算法无症状性性性能的影响。我们的成果用于两种大致的DP方法:一种是使用最小回归来进行功能近似值的,另一种是采用每个缩略度目标的梯度下降步骤。