Recently discovered polyhedral structures of the value function for finite state-action discounted Markov decision processes (MDP) shed light on understanding the success of reinforcement learning. We investigate the value function polytope in greater detail and characterize the polytope boundary using a hyperplane arrangement. We further show that the value space is a union of finitely many cells of the same hyperplane arrangement and relate it to the polytope of the classical linear programming formulation for MDPs. Inspired by these geometric properties, we propose a new algorithm, \emph{Geometric Policy Iteration} (GPI), to solve discounted MDPs. GPI updates the policy of a single state by switching to an action that is mapped to the boundary of the value function polytope, followed by an immediate update of the value function. This new update rule aims at a faster value improvement without compromising computational efficiency. Moreover, our algorithm allows asynchronous updates of state values which is more flexible and advantageous compared to traditional policy iteration when the state set is large. We prove that the complexity of GPI achieves the best known bound $\bigO{\frac{|\actions|}{1 - \gamma}\log \frac{1}{1-\gamma}}$ of policy iteration and empirically demonstrate the strength of GPI on MDPs of various sizes.
翻译:最近为有限的州- 州- 州- 贴现的 Markov 决策进程( MDP) 所发现的值函数多元结构 揭示了对强化学习成功的理解。 我们更详细地调查了值函数多功能, 并使用超机安排对多ope 边界进行了定性。 我们进一步显示, 价值空间是由同一超机安排的有限多细胞结合, 并与MDP 经典线性编程配置的多元性相关联。 受这些几何特性的启发, 我们提出了一个新的算法, \emph{ Geoph{ Geological Policy Iteration } (GMI), 以解决折扣 MDP 。 GPI 更新了单一状态的政策, 转换到要绘制到值函数多功能边界的动作, 并随后立即更新值函数。 这一新更新规则的目的是在不损害计算效率的情况下更快地提高值。 此外, 我们的算法允许在州设置较大时, 与传统政策相较灵活和有利的国家配置时, 。 我们证明GPI $\\\\\\\\\\\ magma praflas a practal practactal practactactaction