We consider the reinforcement learning problem for partially observed Markov decision processes (POMDPs) with large or even countably infinite state spaces, where the controller has access to only noisy observations of the underlying controlled Markov chain. We consider a natural actor-critic method that employs a finite internal memory for policy parameterization, and a multi-step temporal difference learning algorithm for policy evaluation. We establish, to the best of our knowledge, the first non-asymptotic global convergence of actor-critic methods for partially observed systems under function approximation. In particular, in addition to the function approximation and statistical errors that also arise in MDPs, we explicitly characterize the error due to the use of finite-state controllers. This additional error is stated in terms of the total variation distance between the traditional belief state in POMDPs and the posterior distribution of the hidden state when using a finite-state controller. Further, we show that this error can be made small in the case of sliding-block controllers by using larger block sizes.
翻译:我们考虑了部分观测到的马尔科夫决策流程(POMDPs)的强化学习问题,该流程中,控制员只能对受控的马科夫链进行噪音观测。我们考虑的是使用有限的内部内存来进行政策参数化的自然行为者-批评方法,以及用于政策评估的多步时间差异学习算法。我们据我们所知,为功能近似下部分观测的系统建立了第一个非被动的行为体-批评方法全球趋同。特别是,除了功能近似和统计错误外,我们明确了由于使用固定状态控制器而产生的错误。这一额外错误表现为在使用限定状态控制器时,POMDPs的传统信仰状态与隐藏状态的后方分布之间的全面差异距离。此外,我们用较大块尺寸的滑动控制器可以使这一错误变得小。