Recursion is the fundamental paradigm to finitely describe potentially infinite objects. As state-of-the-art reinforcement learning (RL) algorithms cannot directly reason about recursion, they must rely on the practitioner's ingenuity in designing a suitable "flat" representation of the environment. The resulting manual feature constructions and approximations are cumbersome and error-prone; their lack of transparency hampers scalability. To overcome these challenges, we develop RL algorithms capable of computing optimal policies in environments described as a collection of Markov decision processes (MDPs) that can recursively invoke one another. Each constituent MDP is characterized by several entry and exit points that correspond to input and output values of these invocations. These recursive MDPs (or RMDPs) are expressively equivalent to probabilistic pushdown systems (with call-stack playing the role of the pushdown stack), and can model probabilistic programs with recursive procedural calls. We introduce Recursive Q-learning -- a model-free RL algorithm for RMDPs -- and prove that it converges for finite, single-exit and deterministic multi-exit RMDPs under mild assumptions.
翻译:递解是限制描述潜在无限对象的基本范式。 由于最先进的强化学习算法不能直接解释循环, 它们必须在设计合适的环境“ 缩放” 代表时依赖执业者的聪明才智。 由此产生的人工特征构造和近似十分繁琐, 容易出错; 缺乏透明度会妨碍伸缩。 为了克服这些挑战, 我们开发了RL算法, 能够在被描述为可反复引用的Markov决定程序( MDPs)的集合环境中计算最佳政策。 每个构成的 MDP 的特征是几个与这些引用的投入和输出值相对应的出入口点。 这些递归性 MDPs( 或 RMDPs) 明确相当于概率性推倒系统( 由调- stack 起推倒堆的作用 ), 并且可以用循环程序调来模拟概率性方案。 我们引入了累进性Q- 一种RDPs的无型 RL 算法, 并且证明它在温性、 单项和多项假设下趋于一致。