Many reinforcement learning (RL) environments in practice feature enormous state spaces that may be described compactly by a "factored" structure, that may be modeled by Factored Markov Decision Processes (FMDPs). We present the first polynomial-time algorithm for RL with FMDPs that does not rely on an oracle planner, and instead of requiring a linear transition model, only requires a linear value function with a suitable local basis with respect to the factorization. With this assumption, we can solve FMDPs in polynomial time by constructing an efficient separation oracle for convex optimization. Importantly, and in contrast to prior work, we do not assume that the transitions on various factors are independent.
翻译:许多强化学习(RL)环境实际上都具有巨大的状态空间,可以用一个“因素”结构来简要描述,这个结构可以用保理Markov决定程序(FMDPs)来模拟。 我们用FMDPs为RL提供了第一个多边时间算法,而FMDPs并不依赖于甲骨文计划者,我们不要求线性过渡模式,而只是要求一种线性值函数,在乘数化方面有适当的当地基础。有了这一假设,我们可以在多元时间内通过构建一个高效的分离符来解决FMDPs。 重要的是,与先前的工作相比,我们并不认为各种因素的过渡是独立的。