北交大博士：强化学习与策略评估

北交大博士：强化学习与策略评估 | GAIR大讲堂

2017 年 11 月 6 日 AI研习社 不灵叔

分享内容

▼

强化学习(Reinforcement learning)在近几年收到越来越多的关注，对于强化学习的理论探讨也一直是研究热点。这次分享，我们将一起探讨强化学习的理论框架。在此基础上，策略评估(policy evaluation)是强化学习中最基础也是最重要的一个组成部分，其收敛性质的分析对于理解和改进这一类算法非常重要。但是如果只停留在一些非常理想化的假设下，得到的结果往往难以令人信服。在这次要分享的一个工作中，我们将给出一类策略评估算法在一些更贴近实际的假定下（RL天然的数据不独立同分布性，步长多种设置方式等）的收敛速率分析结果，从而更加确切的回答了关于这一类算法收敛性质的疑问，并且提供了解决类似问题的一个可用的理论工具。

建议预读文献

《Finite-sample analysis of proximal gradient td algorithms》

论文地址：http://chercheurs.lille.inria.fr/~ghavamza/my_website/Publications_files/uai15.pdf

In this paper, we show for the first time how gradient TD (GTD) reinforcement learning methods can be formally derived as true stochastic gradient algorithms, not with respect to their original objective functions as previously attempted, but rather using derived primal-dual saddle-point objective functions. We then conduct a saddle-point error analysis to obtain finite-sample bounds on their performance. Previous analyses of this class of algorithms use stochastic approximation techniques to prove asymptotic convergence, and no finite-sample analysis had been attempted. Two novel GTD algorithms are also proposed, namely projected GTD2 and GTD2-MP, which use proximal “mirror maps” to yield improved convergence guarantees and acceleration, respectively. The results of our theoretical analysis imply that the GTD family of algorithms are comparable and may indeed be preferred over existing least squares TD methods for off-policy learning, due to their linear complexity. We provide experimental results showing the improved performance of our accelerated gradient TD methods.

《A convergent o(n) temporal-difference algorithm for off-policy learning with linear function approximation》

论文地址：http://papers.nips.cc/paper/3626-a-convergent-on-temporal-difference-algorithm-for-off-policy-learning-with-linear-function-approximation

We introduce the first temporal-difference learning algorithm that is stable with linear function approximation and off-policy training, for any finite Markov decision process, target policy, and exciting behavior policy, and whose complexity scales linearly in the number of parameters. We consider an i.i.d.\ policy-evaluation setting in which the data need not come from on-policy experience. The gradient temporal-difference (GTD) algorithm estimates the expected update vector of the TD(0) algorithm and performs stochastic gradient descent on its L_2 norm. Our analysis proves that its expected update is in the direction of the gradient, assuring convergence under the usual stochastic approximation conditions to the same least-squares solution as found by the LSTD, but without its quadratic computational complexity. GTD is online and incremental, and does not involve multiplying by products of likelihood ratios as in importance-sampling methods.

《Fast gradient-descent methods for temporal-difference learning with linear function approximation》

论文地址：https://sites.ualberta.ca/~szepesva/papers/GTD-ICML09.pdf

Sutton, Szepesvari and Maei (2009) recently in- ´ troduced the first temporal-difference learning algorithm compatible with both linear function approximation and off-policy training, and whose complexity scales only linearly in the size of the function approximator. Although their gradient temporal difference (GTD) algorithm converges reliably, it can be very slow compared to conventional linear TD (on on-policy problems where TD is convergent), calling into question its practical utility. In this paper we introduce two new related algorithms with better convergence rates. The first algorithm, GTD2, is derived and proved convergent just as GTD was, but uses a different objective function and converges significantly faster (but still not as fast as conventional TD). The second new algorithm, linear TD with gradient correction, or TDC, uses the same update rule as conventional TD except for an additional term which is initially zero. In our experiments on small test problems and in a Computer Go application with a million features, the learning rate of this algorithm was comparable to that of conventional TD. This algorithm appears to extend linear TD to off-policy learning with no penalty in performance while only doubling computational requirements.