A novel optimization approach is proposed for application to policy gradient methods and evolution strategies for reinforcement learning (RL). The procedure uses a computationally efficient Wasserstein natural gradient (WNG) descent that takes advantage of the geometry induced by a Wasserstein penalty to speed optimization. This method follows the recent theme in RL of including a divergence penalty in the objective to establish a trust region. Experiments on challenging tasks demonstrate improvements in both computational cost and performance over advanced baselines.
翻译:提议采用一种新的优化办法,用于政策梯度方法和强化学习的演变战略(RL),该程序利用瓦塞尔斯坦因罚法引起的几何方法,利用计算效率高的瓦塞尔斯坦自然梯度(WNG)来加快优化速度,采用新颖的优化办法,该方法沿用了最近《RL》的主题,即在建立信任区域的目标中列入差异罚法,对具有挑战性的任务进行的实验表明,计算成本和业绩均高于先进的基线。