Hierarchical Reinforcement Learning (HRL) algorithms have been demonstrated to perform well on high-dimensional decision making and robotic control tasks. However, because they solely optimize for rewards, the agent tends to search the same space redundantly. This problem reduces the speed of learning and achieved reward. In this work, we present an Off-Policy HRL algorithm that maximizes entropy for efficient exploration. The algorithm learns a temporally abstracted low-level policy and is able to explore broadly through the addition of entropy to the high-level. The novelty of this work is the theoretical motivation of adding entropy to the RL objective in the HRL setting. We empirically show that the entropy can be added to both levels if the Kullback-Leibler (KL) divergence between consecutive updates of the low-level policy is sufficiently small. We performed an ablative study to analyze the effects of entropy on hierarchy, in which adding entropy to high-level emerged as the most desirable configuration. Furthermore, a higher temperature in the low-level leads to Q-value overestimation and increases the stochasticity of the environment that the high-level operates on, making learning more challenging. Our method, SHIRO, surpasses state-of-the-art performance on a range of simulated robotic control benchmark tasks and requires minimal tuning.
翻译:高层次强化学习(HRL)算法被证明能很好地完成高层次决策和机器人控制任务。 但是,由于它们纯粹是为了奖励而优化,代理商往往会重复搜索相同的空间。 这个问题降低了学习和获得奖励的速度。 在这项工作中, 我们提出了一个离政策HRL算法, 最大限度地增加有效探索的灵敏。 该算法学习了一种时间抽象的低层次政策, 并且能够通过在高层添加诱变来广泛探索。 这项工作的新颖性是将诱变添加到HRL 设置中RL 目标的理论动机。 我们从经验上表明, 如果 Kullback- Leiber (KL) 相继更新的低层次政策之间的差异非常小, 则可以将螺旋添加到两个层次上。 我们进行了一项校准研究, 分析酶对等级的影响, 在高层次上添加诱导到高层次的配置。 此外, 低层次的温度导致 QVIL目标值过高, 并增加了我们高层次的SHIFL 水平的测试方法。