Policy optimization is among the most popular and successful reinforcement learning algorithms, and there is increasing interest in understanding its theoretical guarantees. In this work, we initiate the study of policy optimization for the stochastic shortest path (SSP) problem, a goal-oriented reinforcement learning model that strictly generalizes the finite-horizon model and better captures many applications. We consider a wide range of settings, including stochastic and adversarial environments under full information or bandit feedback, and propose a policy optimization algorithm for each setting that makes use of novel correction terms and/or variants of dilated bonuses (Luo et al., 2021). For most settings, our algorithm is shown to achieve a near-optimal regret bound. One key technical contribution of this work is a new approximation scheme to tackle SSP problems that we call \textit{stacked discounted approximation} and use in all our proposed algorithms. Unlike the finite-horizon approximation that is heavily used in recent SSP algorithms, our new approximation enables us to learn a near-stationary policy with only logarithmic changes during an episode and could lead to an exponential improvement in space complexity.
翻译:政策优化是最受欢迎和最成功的强化学习算法之一,人们对理解其理论保障的兴趣日益浓厚。 在这项工作中,我们开始研究最短路径问题的政策优化问题,这是一个面向目标的强化学习模式,严格地将有限对流模型加以推广,更好地捕捉多种应用。我们考虑多种环境,包括在完整信息或强盗反馈下采用随机和对抗环境,并为每个环境提出政策优化算法,利用新的修正术语和/或扩大奖金变异(Luo等人,2021年)。对于大多数环境,我们的算法显示,我们的算法将实现接近最佳的遗憾。这项工作的一个重要技术贡献是一个新的近似方法,解决SSP问题,我们称之为“textit{cacked skelected pray} 并在我们所有拟议的算法中使用。与最近SSP算法中大量使用的有限对流近,我们的新近似使我们能够学习近固定政策,在一时只进行对数变化,并可能导致空间复杂性的指数性改进。