We analyze the convergence rate of the unregularized natural policy gradient algorithm with log-linear policy parametrizations in infinite-horizon discounted Markov decision processes. In the deterministic case, when the Q-value is known and can be approximated by a linear combination of a known feature function up to a bias error, we show that a geometrically-increasing step size yields a linear convergence rate towards an optimal policy. We then consider the sample-based case, when the best representation of the Q- value function among linear combinations of a known feature function is known up to an estimation error. In this setting, we show that the algorithm enjoys the same linear guarantees as in the deterministic case up to an error term that depends on the estimation error, the bias error, and the condition number of the feature covariance matrix. Our results build upon the general framework of policy mirror descent and extend previous findings for the softmax tabular parametrization to the log-linear policy class.
翻译:我们分析了在无限离子折扣的Markov 决策程序中非常规自然政策梯度算法与对线性政策准差值的趋同率。 在确定性案例中,当Q值为已知的Q值,并且可以用已知特性函数的线性组合相近,从而得出偏差错误时,我们发现几何式增长的步进尺寸可以产生向最佳政策方向的线性趋同率。然后我们考虑了抽样案例,当已知特性函数线性组合中Q值函数的最佳表示度为估计错误时。在这个设置中,我们表明该算法享有与确定性案例相同的线性保证,直至一个错误,取决于估计错误、偏差错误和特征共变矩阵的条件号。我们的结果以政策反向下降的一般框架为基础,并将以前关于软模表表性半调的结果扩大到对线性政策类别。