Policy optimization methods are popular reinforcement learning algorithms in practice. Recent works have built theoretical foundation for them by proving $\sqrt{T}$ regret bounds even when the losses are adversarial. Such bounds are tight in the worst case but often overly pessimistic. In this work, we show that in tabular Markov decision processes (MDPs), by properly designing the regularizer, the exploration bonus and the learning rates, one can achieve a more favorable polylog$(T)$ regret when the losses are stochastic, without sacrificing the worst-case guarantee in the adversarial regime. To our knowledge, this is also the first time a gap-dependent polylog$(T)$ regret bound is shown for policy optimization. Specifically, we achieve this by leveraging a Tsallis entropy or a Shannon entropy regularizer in the policy update. Then we show that under known transitions, we can further obtain a first-order regret bound in the adversarial regime by leveraging the log-barrier regularizer.
翻译:政策优化方法是实践中流行的强化学习算法。最近的工作通过证明$\sqrt{T}$的遗憾界限,甚至当损失是对抗性的时,也为它们建立了理论基础。这种界限在最坏的情况下是紧凑的,但往往过于悲观。在这项工作中,我们通过正确设计正规化器、勘探奖金和学习率,在表格中的Markov决策程序(MDPs)中,如果损失是随机的,在不牺牲对抗制中最坏的保证的情况下,人们可以取得更优惠的多元(T)$的遗憾。据我们所知,这也是第一次为政策优化而展示一个依赖差距的多元(T)$的遗憾界限。具体地说,我们是通过在政策更新中利用Tsalliis entropy或香农的酶正兰化调制器来实现这一点。然后我们证明,在已知的过渡下,我们可以通过利用日志屏障碍调制,在对抗制制度中进一步获得第一阶级的遗憾。