Goal-conditioned reinforcement learning (GCRL) has a wide range of potential real-world applications, including manipulation and navigation problems in robotics. Especially in such robotics tasks, sample efficiency is of the utmost importance for GCRL since, by default, the agent is only rewarded when it reaches its goal. While several methods have been proposed to improve the sample efficiency of GCRL, one relatively under-studied approach is the design of neural architectures to support sample efficiency. In this work, we introduce a novel neural architecture for GCRL that achieves significantly better sample efficiency than the commonly-used monolithic network architecture. The key insight is that the optimal action-value function Q^*(s, a, g) must satisfy the triangle inequality in a specific sense. Furthermore, we introduce the metric residual network (MRN) that deliberately decomposes the action-value function Q(s,a,g) into the negated summation of a metric plus a residual asymmetric component. MRN provably approximates any optimal action-value function Q^*(s,a,g), thus making it a fitting neural architecture for GCRL. We conduct comprehensive experiments across 12 standard benchmark environments in GCRL. The empirical results demonstrate that MRN uniformly outperforms other state-of-the-art GCRL neural architectures in terms of sample efficiency.
翻译:以目标为条件的强化学习(GCRL)具有广泛的潜在现实应用,包括机器人的操纵和导航问题。特别是在此类机器人任务中,样本效率对GCRL至关重要,因为默认情况下,代理商只有在达到目标时才得到奖励;虽然提出了提高GCRL样本效率的若干方法,但一项相对研究不足的方法是设计神经结构以支持样本效率。在这项工作中,我们为GCRL引入了一个新型神经结构,其样本效率大大高于常用的单一网络结构。关键的见解是,最佳行动价值功能 {(s, a, g) 必须在特定意义上满足三角不平等。此外,我们引入了标准残余网络(MRN),它有意将行动价值功能Q(s,a,g) 转化为无效的测量和剩余不对称部分。MRN 准确地估计了任何最佳的行动价值功能 ⁇,a,g,因此它必须满足GCRR标准环境的测试。