The dominant framework for off-policy multi-goal reinforcement learning involves estimating goal conditioned Q-value function. When learning to achieve multiple goals, data efficiency is intimately connected with the generalization of the Q-function to new goals. The de-facto paradigm is to approximate Q(s, a, g) using monolithic neural networks. To improve the generalization of the Q-function, we propose a bilinear decomposition that represents the Q-value via a low-rank approximation in the form of a dot product between two vector fields. The first vector field, f(s, a), captures the environment's local dynamics at the state s; whereas the second component, {\phi}(s, g), captures the global relationship between the current state and the goal. We show that our bilinear decomposition scheme substantially improves data efficiency, and has superior transfer to out-of-distribution goals compared to prior methods. Empirical evidence is provided on the simulated Fetch robot task-suite and dexterous manipulation with a Shadow hand.
翻译:政策外多目标强化学习的主导框架涉及估计目标设定的Q值功能。当学习实现多个目标时,数据效率与Q功能的概括性与新目标密切相关。 脱法模式是使用单一神经网络近似Q(s, a, g) 。 为了改进Q功能的概括性, 我们提议双线分解, 通过两个矢量字段之间的点产品的形式, 以低端近似形式代表Q值。 第一个矢量字段, f(s), a), 捕捉状态环境的本地动态; 而第二个组件, hip(s, g), 捕捉当前状态和目标之间的全球关系。 我们表明,我们的双线分解组合计划大大提高了数据效率, 并且比先前的方法更能向超出分配目标转移。 模拟的机器人任务应用和影子手的极速操纵提供了真知灼见的证据 。