Entropy regularization is an efficient technique for encouraging exploration and preventing a premature convergence of (vanilla) policy gradient methods in reinforcement learning (RL). However, the theoretical understanding of entropy regularized RL algorithms has been limited. In this paper, we revisit the classical entropy regularized policy gradient methods with the soft-max policy parametrization, whose convergence has so far only been established assuming access to exact gradient oracles. To go beyond this scenario, we propose the first set of (nearly) unbiased stochastic policy gradient estimators with trajectory-level entropy regularization, with one being an unbiased visitation measure-based estimator and the other one being a nearly unbiased yet more practical trajectory-based estimator. We prove that although the estimators themselves are unbounded in general due to the additional logarithmic policy rewards introduced by the entropy term, the variances are uniformly bounded. This enables the development of the first set of convergence results for stochastic entropy regularized policy gradient methods to both stationary points and globally optimal policies. We also develop some improved sample complexity results under a good initialization.
翻译:在加强学习中,(香草)政策梯度方法是一种鼓励探索和防止过早融合(香草)政策梯度方法的有效方法。然而,对英特罗平准RL算法的理论理解有限。在本文件中,我们用软式最大政策准化重新审视典型的英特罗普常规政策梯度方法,迄今为止,这些方法的趋同只是假定可以使用精确的梯度或触角才得以确立。为了超越这一设想,我们提议第一套(近距离)不带轨迹水平正统政策梯度估计器,其中一种是无偏颇的访问计量器,另一种是近乎无偏向性但更实际的轨迹度估计器。我们证明,尽管由于使用英特罗普术语引入的更多对数政策奖励,这些估计器本身没有受限制,但差异是统一的。这样可以将第一套(近距离的)不偏颇的政策梯度梯度梯度政策梯度测值测算结果发展成第一套具有轨迹水平的轨定常态政策梯度方法,一种是建立在不偏向定点和全球最佳政策的测测测定梯度梯度方法。我们还根据一些改进的抽样测算结果,我们在初步得出出了一些精度精度精度精度精度精度精度的精度精度精度精度精度精度精度精度精度精度精度。