We focus on parameterized policy search for reinforcement learning over continuous action spaces. Typically, one assumes the score function associated with a policy is bounded, which fails to hold even for Gaussian policies. To properly address this issue, one must introduce an exploration tolerance parameter to quantify the region in which it is bounded. Doing so incurs a persistent bias that appears in the attenuation rate of the expected policy gradient norm, which is inversely proportional to the radius of the action space. To mitigate this hidden bias, heavy-tailed policy parameterizations may be used, which exhibit a bounded score function, but doing so can cause instability in algorithmic updates. To address these issues, in this work, we study the convergence of policy gradient algorithms under heavy-tailed parameterizations, which we propose to stabilize with a combination of mirror ascent-type updates and gradient tracking. Our main theoretical contribution is the establishment that this scheme converges with constant step and batch sizes, whereas prior works require these parameters to respectively shrink to null or grow to infinity. Experimentally, this scheme under a heavy-tailed policy parameterization yields improved reward accumulation across a variety of settings as compared with standard benchmarks.
翻译:我们的焦点是在连续行动空间上强化学习的参数化政策搜索。 一般来说, 假设一项政策的相关评分函数是约束性的, 甚至无法维持高斯政策。 要恰当地解决这个问题, 就必须引入一个勘探容忍参数来量化它所约束的区域。 这样做会产生一种长期的偏差, 出现在预期的政策梯度规范的减速率中, 与行动空间的半径成反比。 为了减轻这种隐蔽的偏差, 可能会使用与某项政策相关的评分函数, 显示一个约束性的评分函数, 但这样做会导致算法更新的不稳定。 为了解决这些问题, 我们在此工作中研究重度化参数化下的政策梯度算法的趋同, 我们建议用镜像作为偏移式的更新和梯度跟踪的组合来稳定。 我们的主要理论贡献是建立这个方案与固定的步态和批量大小相融合, 而以前的工程要求这些参数分别缩缩缩, 或成长到无限的。 实验性地说, 在重度化的政策参数化政策参数化下, 这个计划可以提高各种环境的标准的奖励积累。