Many applications of reinforcement learning can be formalized as goal-conditioned environments, where, in each episode, there is a "goal" that affects the rewards obtained during that episode but does not affect the dynamics. Various techniques have been proposed to improve performance in goal-conditioned environments, such as automatic curriculum generation and goal relabeling. In this work, we explore a connection between off-policy reinforcement learning in goal-conditioned settings and knowledge distillation. In particular: the current Q-value function and the target Q-value estimate are both functions of the goal, and we would like to train the Q-value function to match its target for all goals. We therefore apply Gradient-Based Attention Transfer (Zagoruyko and Komodakis 2017), a knowledge distillation technique, to the Q-function update. We empirically show that this can improve the performance of goal-conditioned off-policy reinforcement learning when the space of goals is high-dimensional. We also show that this technique can be adapted to allow for efficient learning in the case of multiple simultaneous sparse goals, where the agent can attain a reward by achieving any one of a large set of objectives, all specified at test time. Finally, to provide theoretical support, we give examples of classes of environments where (under some assumptions) standard off-policy algorithms require at least O(d^2) observed transitions to learn an optimal policy, while our proposed technique requires only O(d) transitions, where d is the dimensionality of the goal and state space.
翻译:强化学习的许多应用可以正式确定为受目标制约的环境,每个阶段都有“目标”影响在这一阶段获得的奖励,但不影响动态。我们提出了各种技术来改进在受目标制约的环境中的业绩,例如自动课程编制和重新标签。在这项工作中,我们探索了在受目标制约的环境和知识蒸馏中加强政策外学习与目标外学习之间的联系。特别是:当前Q值功能和指标Q值估计既是目标的功能,也是目标的功能。我们希望对Q值函数进行培训,使其与所有目标的目标相符。因此,我们采用“基于关注的渐进转移”(Zagoruyko和Komodakis 2017年),一种知识蒸馏技术来改进在受目标制约的环境中改进目标外学习的绩效。我们从经验上表明,当目标空间是高度时,这种技术可以被调整为在多个同时稀少的目标中高效学习,在此情况下,代理人可以达到与其所有目标相匹配的目标相匹配的“基于重点关注的转移”(Zagorruyko和Komodakis 2017年),一种知识蒸馏技术用于Q级的更新。我们最后需要一个高层次的理论环境,从而获得某种最佳的学习。