In this paper, we present a novel Heavy-Tailed Stochastic Policy Gradient (HT-PSG) algorithm to deal with the challenges of sparse rewards in continuous control problems. Sparse reward is common in continuous control robotics tasks such as manipulation and navigation, and makes the learning problem hard due to non-trivial estimation of value functions over the state space. This demands either reward shaping or expert demonstrations for the sparse reward environment. However, obtaining high-quality demonstrations is quite expensive and sometimes even impossible. We propose a heavy-tailed policy parametrization along with a modified momentum-based policy gradient tracking scheme (HT-SPG) to induce a stable exploratory behavior to the algorithm. The proposed algorithm does not require access to expert demonstrations. We test the performance of HT-SPG on various benchmark tasks of continuous control with sparse rewards such as 1D Mario, Pathological Mountain Car, Sparse Pendulum in OpenAI Gym, and Sparse MuJoCo environments (Hopper-v2). We show consistent performance improvement across all tasks in terms of high average cumulative reward. HT-SPG also demonstrates improved convergence speed with minimum samples, thereby emphasizing the sample efficiency of our proposed algorithm.
翻译:在本文中,我们展示了一种新型的 " 重力拖车政策梯度 " (HT-PSG)算法,以应对连续控制问题中微弱的奖赏,在连续控制机器人任务(如操纵和导航)方面常见,使学习问题因对国家空间价值功能的非三重估计而变得困难。这要求为稀薄的奖赏环境进行奖赏塑造或专家演示。然而,获得高质量的示范非常昂贵,有时甚至是不可能的。我们提议了一种重整政策配方,同时提出了一种以动力为基础的政策梯度跟踪计划(HT-SPG),以引导对算法采取稳定的探索行为。提议的算法并不要求专家演示。我们测试了HT-SPG在各种基准任务方面的表现,即以稀薄的奖赏(如1D Mario、路德山车、Opathicalian Car、OpormAI Gym的Sparse Pentulum和Sparse Mujoco环境(Hoper-v.2)。我们显示了在高平均累积奖赏方面的所有任务中不断改进业绩。HT-SGGGGPGPGPGPGG),我们还展示了与最低样品的效率。我们提议的样品。我们强调了我们拟议的效率。