Bisimulation metrics define a distance measure between states of a Markov decision process (MDP) based on a comparison of reward sequences. Due to this property they provide theoretical guarantees in value function approximation. In this work we first prove that bisimulation metrics can be defined via any $p$-Wasserstein metric for $p\geq 1$. Then we describe an approximate policy iteration (API) procedure that uses $\epsilon$-aggregation with $\pi$-bisimulation and prove performance bounds for continuous state spaces. We bound the difference between $\pi$-bisimulation metrics in terms of the change in the policies themselves. Based on these theoretical results, we design an API($\alpha$) procedure that employs conservative policy updates and enjoys better performance bounds than the naive API approach. In addition, we propose a novel trust region approach which circumvents the requirement to explicitly solve a constrained optimization problem. Finally, we provide experimental evidence of improved stability compared to non-conservative alternatives in simulated continuous control.
翻译:根据对奖赏序列的比较,Bisimation 衡量标准在Markov 决策程序(MDP)各邦之间界定了距离。 由于此属性, 它们提供了价值函数近似值的理论保证。 在这项工作中,我们首先证明,通过任何P$-Wasserstein 衡量 $p\geq 1美元,可以确定闪烁量值。 然后我们描述了一种大约的政策重复(API)程序,该程序使用$\epsilon-gregation, $\pi$- spection, 并证明连续国家空间的性能界限。 我们从政策变化的角度将$\pi$- speimation 衡量标准加以约束。 基于这些理论结果, 我们设计了一个API ($\alpha$) 程序, 采用保守的政策更新, 并比天真的API 方法有更好的性能约束。 此外, 我们提出一种新的信任区域方法, 规避明确解决限制优化问题的要求。 最后, 我们提供实验性证据表明, 相对于模拟持续控制中的非保护性替代方法, 更加稳定。