We revisit the standard formulation of tabular actor-critic algorithm as a two time-scale stochastic approximation with value function computed on a faster time-scale and policy computed on a slower time-scale. This emulates policy iteration. We begin by observing that reversal of the time scales will in fact emulate value iteration and is a legitimate algorithm. We provide a proof of convergence and compare the two empirically with and without function approximation (with both linear and nonlinear function approximators) and observe that our proposed critic-actor algorithm performs on par with actor-critic in terms of both accuracy and computational effort.
翻译:我们重新审视基于表格的演员-评论家算法的标准公式,作为一个两个时间尺度的随机逼近,其中值函数在更快的时间尺度上计算,策略在更慢的时间尺度上计算。这模拟了策略迭代。我们开始观察到时间尺度的翻转实际上模拟了值迭代,并且是一种合法的算法。我们提供了收敛的证明,并通过使用函数逼近(使用线性和非线性函数逼近器),与和不带函数逼近的演员-评论家进行比较,观察到我们提出的评论家-演员算法在精度和计算功耗方面与演员-评论家相当。