Reinforcement learning in partially observed Markov decision processes (POMDPs) faces two challenges. (i) It often takes the full history to predict the future, which induces a sample complexity that scales exponentially with the horizon. (ii) The observation and state spaces are often continuous, which induces a sample complexity that scales exponentially with the extrinsic dimension. Addressing such challenges requires learning a minimal but sufficient representation of the observation and state histories by exploiting the structure of the POMDP. To this end, we propose a reinforcement learning algorithm named Embed to Control (ETC), which learns the representation at two levels while optimizing the policy.~(i) For each step, ETC learns to represent the state with a low-dimensional feature, which factorizes the transition kernel. (ii) Across multiple steps, ETC learns to represent the full history with a low-dimensional embedding, which assembles the per-step feature. We integrate (i) and (ii) in a unified framework that allows a variety of estimators (including maximum likelihood estimators and generative adversarial networks). For a class of POMDPs with a low-rank structure in the transition kernel, ETC attains an $O(1/\epsilon^2)$ sample complexity that scales polynomially with the horizon and the intrinsic dimension (that is, the rank). Here $\epsilon$ is the optimality gap. To our best knowledge, ETC is the first sample-efficient algorithm that bridges representation learning and policy optimization in POMDPs with infinite observation and state spaces.
翻译:部分观测到的Markov决策流程(POMDPs)的强化学习面临两个挑战。 (一) 通常需要全部历史才能预测未来,从而产生与地平线成指数比例的样本复杂性。 (二) 观测和州空间往往是连续的,从而产生与外部层面成指数的样本复杂性。 应对这些挑战需要通过利用POMDP的结构来了解观测和州历史的最小但足够的代表性。 为此,我们提议了一个名为 " 嵌入控制 " 的强化学习算法(ETC),该算法在优化政策的同时,在两个层次上学习代表性。 (一) ~ (一) 每一步,电子计算算算法都学会以低维特征代表国家。 (二) 跨多个步骤,电子计算算法学会以低维度嵌入空间来代表整个历史。 我们整合了(一) 和(二) 美元,在一个最佳框架(包括最大可能性的测算和对等度的对调)观测网络。 对于一个低维度特征的EOMDP结构, 和一个低级的EOMQ级的升级结构, 正在一个低级的 EOMDP结构 。