可扩展的基于策略的POMDP强化学习算法 (Scalable Policy-Based RL Algorithms for POMDPs)

The continuous nature of belief states in POMDPs presents significant computational challenges in learning the optimal policy. In this paper, we consider an approach that solves a Partially Observable Reinforcement Learning (PORL) problem by approximating the corresponding POMDP model into a finite-state Markov Decision Process (MDP) (called Superstate MDP). We first derive theoretical guarantees that improve upon prior work that relate the optimal value function of the transformed Superstate MDP to the optimal value function of the original POMDP. Next, we propose a policy-based learning approach with linear function approximation to learn the optimal policy for the Superstate MDP. Consequently, our approach shows that a POMDP can be approximately solved using TD-learning followed by Policy Optimization by treating it as an MDP, where the MDP state corresponds to a finite history. We show that the approximation error decreases exponentially with the length of this history. To the best of our knowledge, our finite-time bounds are the first to explicitly quantify the error introduced when applying standard TD learning to a setting where the true dynamics are not Markovian.

翻译：部分可观测马尔可夫决策过程（POMDP）中信念状态的连续性给学习最优策略带来了显著的计算挑战。本文提出一种方法，通过将对应的POMDP模型近似为有限状态马尔可夫决策过程（称为超状态MDP）来解决部分可观测强化学习（PORL）问题。我们首先推导了理论保证，改进了先前工作中关于变换后的超状态MDP最优值函数与原POMDP最优值函数关系的研究。接着，我们提出一种基于线性函数近似的策略学习方法，用于学习超状态MDP的最优策略。因此，我们的方法表明，通过将POMDP视为MDP（其中MDP状态对应有限历史），可以先用时序差分学习再通过策略优化来近似求解。我们证明了近似误差随历史长度呈指数级下降。据我们所知，我们的有限时间界首次明确量化了在真实动态非马尔可夫性的场景中应用标准时序差分学习时引入的误差。