In temporal difference (TD) learning, off-policy sampling is known to be more practical than on-policy sampling, and by decoupling learning from data collection, it enables data reuse. It is known that policy evaluation (including multi-step off-policy importance sampling) has the interpretation of solving a generalized Bellman equation. In this paper, we derive finite-sample bounds for any general off-policy TD-like stochastic approximation algorithm that solves for the fixed-point of this generalized Bellman operator. Our key step is to show that the generalized Bellman operator is simultaneously a contraction mapping with respect to a weighted $\ell_p$-norm for each $p$ in $[1,\infty)$, with a common contraction factor. Off-policy TD-learning is known to suffer from high variance due to the product of importance sampling ratios. A number of algorithms (e.g. $Q^\pi(\lambda)$, Tree-Backup$(\lambda)$, Retrace$(\lambda)$, and $Q$-trace) have been proposed in the literature to address this issue. Our results immediately imply finite-sample bounds of these algorithms. In particular, we provide first-known finite-sample guarantees for $Q^\pi(\lambda)$, Tree-Backup$(\lambda)$, and Retrace$(\lambda)$, and improve the best known bounds of $Q$-trace in [19]. Moreover, we show the bias-variance trade-offs in each of these algorithms.
翻译:在时间差异(TD)学习中,已知离政策抽样比政策抽样更加实用,并且通过与数据收集脱钩,使数据得以再利用。众所周知,政策评价(包括多步离政策重要性抽样)可以解释如何解决一个通用的贝尔曼方程式。在本文中,我们为任何通用的离政策(TD)相似的离政策近似缩略算算法得出了一定的标码。我们的关键步骤是表明,通用的贝尔曼操作员同时对每支美元($=ell_p$)的加权美元进行收缩图绘制,每支美元($$$$$)的加权收缩图绘制($\ell_pat-norum),并有一个共同的收缩系数。已知非政策TD学习会因重要取样率的产物而出现很大差异。已知的算法数(如 $ ⁇ pi(\lambda) $ 、 树-backuple$(lack) $(real-bda美元) 和reab-ralalals 等文献中显示,我们最清楚的版本的数值结果。