We focus on parameterized policy search for reinforcement learning over continuous action spaces. Typically, one assumes the score function associated with a policy is bounded, which {fails to hold even for Gaussian policies. } To properly address this issue, one must introduce an exploration tolerance parameter to quantify the region in which it is bounded. Doing so incurs a persistent bias that appears in the attenuation rate of the expected policy gradient norm, which is inversely proportional to the radius of the action space. To mitigate this hidden bias, heavy-tailed policy parameterizations may be used, which exhibit a bounded score function, but doing so can cause instability in algorithmic updates. To address these issues, in this work, we study the convergence of policy gradient algorithms under heavy-tailed parameterizations, which we propose to stabilize with a combination of mirror ascent-type updates and gradient tracking. Our main theoretical contribution is the establishment that this scheme converges with constant step and batch sizes, whereas prior works require these parameters to respectively shrink to null or grow to infinity. Experimentally, this scheme under a heavy-tailed policy parameterization yields improved reward accumulation across a variety of settings as compared with standard benchmarks.
翻译:在连续的动作空间中,我们侧重于强化学习的参数化政策搜索。 通常, 一个人会假设与一项政策相关的分数函数是捆绑的, {甚至无法维持到高斯政策 。} 要正确解决这个问题, 我们必须引入一个探索容忍参数来量化它所被捆绑的区域。 这样做会产生一种长期的偏差, 表现在预期的政策梯度规范的减速率上, 与行动空间的半径成反比。 为了减轻这种隐藏的偏差, 可能会使用与一项政策相关的重尾细政策参数化, 显示一个捆绑的分数函数, 但这样做会导致算法更新的不稳定 。 为了解决这些问题, 在这项工作中, 我们研究重尾的参数化参数化下的政策梯度算的趋同, 我们建议用镜像作为中精度的更新和梯度跟踪的组合来稳定。 我们的主要理论贡献是确定这个方案会与固定的步态和批量大小相融合, 而以前的工程需要这些参数分别缩缩成或成长成定的。 实验性, 这个方案在重尾的参数化政策参数化基准下, 将比重的参数化标准化为不同的指标性指标性。