Folklore suggests that policy gradient can be more robust to misspecification than its relative, approximate policy iteration. This paper studies the case of state-aggregated representations, where the state space is partitioned and either the policy or value function approximation is held constant over partitions. This paper shows a policy gradient method converges to a policy whose regret per-period is bounded by $\epsilon$, the largest difference between two elements of the state-action value function belonging to a common partition. With the same representation, both approximate policy iteration and approximate value iteration can produce policies whose per-period regret scales as $\epsilon/(1-\gamma)$, where $\gamma$ is a discount factor. Faced with inherent approximation error, methods that locally optimize the true decision-objective can be far more robust.
翻译:民俗认为,政策梯度比其相对的、 近似的政策迭代更强, 比其相对的、 近似的政策迭代更强。 本文研究了国家汇总的表述方式, 国家空间被分割, 政策或价值函数近似在分割区上保持恒定。 本文显示了一种政策梯度方法, 政策梯度方法与一个政策相趋一致, 该政策对每期的遗憾是$- epsilon 美元, 这是属于共同分割区的国家- 行动价值函数中两个元素的最大差异。 以同样的表述方式, 政策迭代和近似值迭代可以产生每期的遗憾等级为$\ epsilon/ (1-\ gamma)$( 1-\ gamma) 的政策, 美元是一个折扣系数。 面对内在的近似错误, 当地优化真正决策目标的方法可能更加稳健。