An algorithm is proposed for policy evaluation in Markov Decision Processes which gives good empirical results with respect to convergence rates. The algorithm tracks the Projected Bellman Error and is implemented as a true gradient based algorithm. In this respect this algorithm differs from TD($\lambda$) class of algorithms. This algorithm tracks the Projected Bellman Algorithm and is therefore different from the class of residual algorithms. Further the convergence of this algorithm is empirically much faster than GTD2 class of algorithms which aim at tracking the Projected Bellman Error. We implemented proposed algorithm in DQN and DDPG framework and found that our algorithm achieves comparable results in both of these experiments
翻译:Markov 决策程序为政策评估提出了算法建议,该算法在趋同率方面产生良好的经验结果。算法跟踪预测贝尔曼错误,并作为真正的梯度算法实施。在这方面,这一算法不同于TD($\lambda$)类算法。这个算法跟踪预测贝尔曼 Algorithm,因此与剩余算法类别不同。这一算法的进一步趋同比GTD2类旨在追踪预测贝尔曼错误的算法在经验上要快得多。我们在DQN和DDPG框架中应用了拟议的算法,发现我们的算法在这两个实验中都取得了类似的结果。