Policy Optimization (PO) algorithms have been proven particularly suited to handle the high-dimensionality of real-world continuous control tasks. In this context, Trust Region Policy Optimization methods represent a popular approach to stabilize the policy updates. These usually rely on the Kullback-Leibler (KL) divergence to limit the change in the policy. The Wasserstein distance represents a natural alternative, in place of the KL divergence, to define trust regions or to regularize the objective function. However, state-of-the-art works either resort to its approximations or do not provide an algorithm for continuous state-action spaces, reducing the applicability of the method. In this paper, we explore optimal transport discrepancies (which include the Wasserstein distance) to define trust regions, and we propose a novel algorithm - Optimal Transport Trust Region Policy Optimization (OT-TRPO) - for continuous state-action spaces. We circumvent the infinite-dimensional optimization problem for PO by providing a one-dimensional dual reformulation for which strong duality holds. We then analytically derive the optimal policy update given the solution of the dual problem. This way, we bypass the computation of optimal transport costs and of optimal transport maps, which we implicitly characterize by solving the dual formulation. Finally, we provide an experimental evaluation of our approach across various control tasks. Our results show that optimal transport discrepancies can offer an advantage over state-of-the-art approaches.
翻译:实践证明,政策优化算法特别适合于处理现实世界持续控制任务的高度层面。在这方面,信任区域政策优化方法代表了稳定政策更新的流行性方法。这些方法通常依赖库尔贝克-利伯尔(KL)差异来限制政策的变化。瓦塞尔斯坦距离代表了一种自然的替代方法,取代了KL差异,以定义信任区域或规范客观功能。然而,最先进的方法要么采用其近似法,要么不提供持续国家行动空间的算法,降低方法的适用性。在本文中,我们探索了界定信任区域的最佳运输差异(包括瓦塞斯坦距离),我们提出了一种新颖的算法――最佳运输信任区域政策优化化(OT-TRPO)――以持续的国家行动空间。我们绕过对PO的无限程度优化优化问题,为此提供了一维度的双重调整方法。我们随后通过分析推导得出最佳的政策更新方法,因为双重问题的解决办法是解决双重问题。我们通过这一方法,我们绕过了双向运输提供最佳运输优势的双向分析,我们提供了最佳的计算方法。