Reinforcement learning is a framework for interactive decision-making with incentives sequentially revealed across time without a system dynamics model. Due to its scaling to continuous spaces, we focus on policy search where one iteratively improves a parameterized policy with stochastic policy gradient (PG) updates. In tabular Markov Decision Problems (MDPs), under persistent exploration and suitable parameterization, global optimality may be obtained. By contrast, in continuous space, the non-convexity poses a pathological challenge as evidenced by existing convergence results being mostly limited to stationarity or arbitrary local extrema. To close this gap, we step towards persistent exploration in continuous space through policy parameterizations defined by distributions of heavier tails defined by tail-index parameter alpha, which increases the likelihood of jumping in state space. Doing so invalidates smoothness conditions of the score function common to PG. Thus, we establish how the convergence rate to stationarity depends on the policy's tail index alpha, a Holder continuity parameter, integrability conditions, and an exploration tolerance parameter introduced here for the first time. Further, we characterize the dependence of the set of local maxima on the tail index through an exit and transition time analysis of a suitably defined Markov chain, identifying that policies associated with Levy Processes of a heavier tail converge to wider peaks. This phenomenon yields improved stability to perturbations in supervised learning, which we corroborate also manifests in improved performance of policy search, especially when myopic and farsighted incentives are misaligned.
翻译:强化学习是互动决策的框架,其激励是连续地在没有系统动态模型的情况下不断显示的。由于它向连续空间的扩展,我们侧重于政策搜索,在政策搜索中,一个迭代改进了带有随机政策梯度(PG)更新的参数性政策。在表式Markov决策问题(MDPs)中,在持续探索和适当参数化下,可以实现全球最佳化。相比之下,在连续的空间,非凝固性构成了一种病理挑战,现有趋同结果主要局限于固定性或任意的局部极端。为了缩小这一差距,我们通过由尾巴-指数参数阿尔法界定的较重尾巴分布定义的政策参数来持续探索空间,这增加了在州空间中跳跃的可能性。在PG通用的得分函数的平滑性条件中,因此,我们确定稳定度的趋同率取决于政策的尾部指数阿尔法、持住者延续性参数、不均匀性条件和首次引入的勘探容忍参数。此外,我们把地方最重的底尾部的尾部的尾部的尾部的尾部参数分为一个更重的依附着于更深层的底部政策,通过一个更深层的升级的升级的升级的升级的升级的升级的流程,通过一个明确的出口和升级的升级的升级的流程,然后确定一个更深层的升级的升级的升级的升级的学习过程的走向。