Reinforcement learning (RL) is empirically successful in complex nonlinear Markov decision processes (MDPs) with continuous state spaces. By contrast, the majority of theoretical RL literature requires the MDP to satisfy some form of linear structure, in order to guarantee sample efficient RL. Such efforts typically assume the transition dynamics or value function of the MDP are described by linear functions of the state features. To resolve this discrepancy between theory and practice, we introduce the Effective Planning Window (EPW) condition, a structural condition on MDPs that makes no linearity assumptions. We demonstrate that the EPW condition permits sample efficient RL, by providing an algorithm which provably solves MDPs satisfying this condition. Our algorithm requires minimal assumptions on the policy class, which can include multi-layer neural networks with nonlinear activation functions. Notably, the EPW condition is directly motivated by popular gaming benchmarks, and we show that many classic Atari games satisfy this condition. We additionally show the necessity of conditions like EPW, by demonstrating that simple MDPs with slight nonlinearities cannot be solved sample efficiently.
翻译:强化学习(RL)在具有连续状态空间的复杂非线性马尔科夫决策程序(MDPs)中取得了经验上的成功。相比之下,大多数理论性RL文献要求MDP满足某种形式的线性结构,以保证样本效率RL。这种努力通常假定MDP的过渡动态或价值功能是州特征的线性功能所描述的。为了解决理论与实践之间的这一差异,我们引入了有效规划窗口(EPW)条件,这是MDPs的一个结构条件,没有线性假设。我们通过提供一种可以解决满足这一条件的模型,证明EPW条件允许样本有效RL。我们的算法要求在政策类别上作出最低限度的假设,其中可包括具有非线性激活功能的多层神经网络。值得注意的是,EPW条件是由流行的游戏基准直接驱动的,我们表明许多典型的Atari游戏满足了这一条件。我们通过证明简单的非线性磁性模型无法有效解决。