Bisimulation metrics define a distance measure between states of a Markov decision process (MDP) based on a comparison of reward sequences. Due to this property they provide theoretical guarantees in value function approximation (VFA). In this work we first prove that bisimulation and $\pi$-bisimulation metrics can be defined via a more general class of Sinkhorn distances, which unifies various state similarity metrics used in recent work. Then we describe an approximate policy iteration (API) procedure that uses a bisimulation-based discretization of the state space for VFA and prove asymptotic performance bounds. Next, we bound the difference between $\pi$-bisimulation metrics in terms of the change in the policies themselves. Based on these results, we design an API($\alpha$) procedure that employs conservative policy updates and enjoys better performance bounds than the naive API approach. We discuss how such API procedures map onto practical actor-critic methods that use bisimulation metrics for state representation learning. Lastly, we validate our theoretical results and investigate their practical implications via a controlled empirical analysis based on an implementation of bisimulation-based API for finite MDPs.
翻译:生物模拟指标根据对奖赏序列的比较,界定了Markov决定程序(MDP)各州之间的距离测量。 由于这一属性,它们为价值功能近似(VFA)提供了理论保证。在这项工作中,我们首先证明,通过更一般的Sinkhorn距离类别,可以界定闪烁和美元/美元/美元/美元/优惠比照标准,这种标准统一了最近工作中使用的各种州类类似指标。然后我们描述了一种近似的政策迭代(API)程序,该程序使用基于优惠的分解国家空间用于VFA, 并证明是无药效约束的。接下来,我们从政策变化本身的角度将美元/pi$-优惠比照标准区别开来。根据这些结果,我们设计了一个API($/alpha$)程序,采用保守的政策更新,并比天真的API方法拥有更好的性约束。我们讨论了这种API程序如何将实用的行为者-刺激方法用于国家代表性学习。最后,我们验证了我们的理论结果,并通过控制性磁性分析来调查其实际影响。