Recently discovered polyhedral structures of the value function for finite state-action discounted Markov decision processes (MDP) shed light on understanding the success of reinforcement learning. We investigate the value function polytope in greater detail and characterize the polytope boundary using a hyperplane arrangement. We further show that the value space is a union of finitely many cells of the same hyperplane arrangement and relate it to the polytope of the classical linear programming formulation for MDPs. Inspired by these geometric properties, we propose a new algorithm, Geometric Policy Iteration (GPI), to solve discounted MDPs. GPI updates the policy of a single state by switching to an action that is mapped to the boundary of the value function polytope, followed by an immediate update of the value function. This new update rule aims at a faster value improvement without compromising computational efficiency. Moreover, our algorithm allows asynchronous updates of state values which is more flexible and advantageous compared to traditional policy iteration when the state set is large. We prove that the complexity of GPI achieves the best known bound $\mathcal{O}\left(\frac{|\mathcal{A}|}{1 - \gamma}\log \frac{1}{1-\gamma}\right)$ of policy iteration and empirically demonstrate the strength of GPI on MDPs of various sizes.
翻译:最近为有限的州- 州- 州- 贴现 Markov 决策进程( MDP) 所发现的数值函数的多元值结构显示对强化学习成功的认识。 我们更详细地调查多功能值, 并使用超机安排来描述多功能边界。 我们进一步显示, 价值空间是由同一超机安排的有限多单元格结合, 并与MDP 经典线性编程配置的多元值相联系。 受这些几何特性的启发, 我们提议一种新的算法, 几何政策迭代( GPI), 以解决折扣 MDP 。 GPI 更新单一状态的政策, 转换为绘制到值多功能边界的动作, 并随后立即更新值函数的边界 。 这个新的更新规则旨在更快地提高值, 同时又不损害计算效率 。 此外, 我们的算法允许在州集规模较大时, 与传统政策的强度相比, 更灵活和更有利的状态值更新。 我们证明 GPI 的复杂性在 G\\\\\\\\\\\\\\\\\\\\\\\\\\ lax m pray s ad pisal excial s polication press poligistrate signal