We study Reinforcement Learning for partially observable dynamical systems using function approximation. We propose a new \textit{Partially Observable Bilinear Actor-Critic framework}, that is general enough to include models such as observable tabular Partially Observable Markov Decision Processes (POMDPs), observable Linear-Quadratic-Gaussian (LQG), Predictive State Representations (PSRs), as well as a newly introduced model Hilbert Space Embeddings of POMDPs and observable POMDPs with latent low-rank transition. Under this framework, we propose an actor-critic style algorithm that is capable of performing agnostic policy learning. Given a policy class that consists of memory based policies (that look at a fixed-length window of recent observations), and a value function class that consists of functions taking both memory and future observations as inputs, our algorithm learns to compete against the best memory-based policy in the given policy class. For certain examples such as undercomplete observable tabular POMDPs, observable LQGs and observable POMDPs with latent low-rank transition, by implicitly leveraging their special properties, our algorithm is even capable of competing against the globally optimal policy without paying an exponential dependence on the horizon in its sample complexity.
翻译:我们研究部分可见动态系统使用功能近似值的强化学习。 我们提议一个新的 & textit{ 部分可观测的双线 Actor- Critical 框架}, 这个框架非常笼统, 足以包括一些模型, 如可观测表单部分可观测的Markov 决策程序(POMDPs), 可观测线- Quadrat- Gausian(LQG), 预测国家代表(PSRs), 以及一个新推出的Hilbert空间模型, 包括POMDPs和具有潜伏低级过渡期的可观测POMDPs的模拟空间嵌入。 在这个框架内, 我们提出了能够进行不可知政策学习的行为者- 风格算法。 鉴于政策类别包括基于记忆的政策( 以最近观测的固定时间窗口为基准), 以及包含同时将记忆和未来的观察作为投入的功能的值函数类别, 我们的算法学会如何与特定政策级的最佳记忆政策竞争。 某些例子, 例如不完全可观测表式的 POMDPs, 和可观测的POMDPs 和可观测的POMDPs 和可观测的POMDPs 等 以其具有潜值的深度的深度的顶部, 通过其核心的深度的深度的深度的深度的深度的模型的模型的模型的模型的转换,, 通过在不以其最低的深度的深度的模型的变压层的模型的模型的模型的模型 。