Goal-conditioned reinforcement learning (GCRL) has a wide range of potential real-world applications, including manipulation and navigation problems in robotics. Especially in such robotics task, sample efficiency is of the utmost importance for GCRL since, by default, the agent is only rewarded when it reaches its goal. While several methods have been proposed to improve the sample efficiency of GCRL, one relatively under-studied approach is the design of neural architectures to support sample efficiency. In this work, we introduce a novel neural architecture for GCRL that achieves significantly better sample efficiency than the commonly-used monolithic network architecture. They key insight is that the optimal action value function Q^*(s, a, g) must satisfy the triangle inequality in a specific sense. Furthermore, we introduce the metric residual network (MRN) that deliberately decomposes the action-value function Q(s,a,g) into the negated summation of a metric plus a residual asymmetric component. MRN provably approximates any optimal action-value function Q^*(s,a,g), thus making it a fitting neural architecture for GCRL. We conduct comprehensive experiments across 12 standard benchmark environments in GCRL. The empirical results demonstrate that MRN uniformly outperforms other state-of-the-art GCRL neural architectures in terms of sample efficiency.
翻译:以目标为条件的强化学习(GCRL)具有广泛的潜在现实应用,包括机器人的操纵和导航问题。特别是在这种机器人任务中,样本效率对GCRL至关重要,因为默认情况下,代理商只有在达到目标时才得到奖励。虽然提出了提高GCRL样本效率的几种方法,但一项研究相对不足的方法是设计神经结构以支持样本效率。在这项工作中,我们为GCRL引入了一个新型神经结构,其样本效率大大高于常用的单一网络结构。它们的关键洞察力是,最佳行动价值功能 {(s, a, g) 必须具体地满足三角不平等。此外,我们引入了标准残余网络(MRN),有意将行动价值功能Q(s,a,g)转化为否定的测量和剩余不对称部分。MRN可以准确地匹配任何最佳的行动价值功能。因此,在GCRL标准环境上,将GCRR(CR)标准的其他标准环境的神经结构比对GCR标准。