目标有条件学习Q-学习作为知识蒸馏 (Goal-Conditioned Q-Learning as Knowledge Distillation)

Many applications of reinforcement learning can be formalized as goal-conditioned environments, where, in each episode, there is a "goal" that affects the rewards obtained during that episode but does not affect the dynamics. Various techniques have been proposed to improve performance in goal-conditioned environments, such as automatic curriculum generation and goal relabeling. In this work, we explore a connection between off-policy reinforcement learning in goal-conditioned settings and knowledge distillation. In particular: the current Q-value function and the target Q-value estimate are both functions of the goal, and we would like to train the Q-value function to match its target for all goals. We therefore apply Gradient-Based Attention Transfer (Zagoruyko and Komodakis 2017), a knowledge distillation technique, to the Q-function update. We empirically show that this can improve the performance of goal-conditioned off-policy reinforcement learning when the space of goals is high-dimensional. We also show that this technique can be adapted to allow for efficient learning in the case of multiple simultaneous sparse goals, where the agent can attain a reward by achieving any one of a large set of objectives, all specified at test time. Finally, to provide theoretical support, we give examples of classes of environments where (under some assumptions) standard off-policy algorithms such as DDPG require at least O(d^2) replay buffer transitions to learn an optimal policy, while our proposed technique requires only O(d) transitions, where d is the dimensionality of the goal and state space. Code is available at https://github.com/alevine0/ReenGAGE.

翻译：强化学习的许多应用可以正规化为受目标制约的环境, 在每个事件里, 都有“ 目标” 影响在这一事件中获得的奖励, 但不影响动态。已经提出了各种技术来改善在受目标制约的环境中的绩效, 比如自动课程制作和重新标签。在这项工作中, 我们探索了在受目标制约的设置和知识蒸馏中强化政策外学习的绩效。特别是: 当前 Q- 价值函数和目标 Q- 价值估计是目标的功能, 并且我们希望对Q- 价值函数进行培训, 以配合其所有目标的目标。因此, 我们应用了“ 基于关注的“ 梯度转移 ” ( Zagoruyko 和 Komodakis 2017), 一种知识蒸馏技术, 来改进在目标空间高度时强化政策外学习的绩效。我们还表明, 这一技术可以被调整到允许在多个同时缺乏目标的情况下高效学习。我们的代理商在其中可以达到一个升级的 O- 的 O 水平, 最后, 将一个标准的的标准的的的水平水平向一个级提供我们的的的的的的的级级级级的的的的的级级级级级级级的, 的的的的级级级级级级级级级级级级级级级级级级级级级级级级级级级级级级级级级级级级级级级级级级级级级级级级级级级级级级级级级级级级级级级级级级级级级级级级级级级级级级级级级级级级级级级级级级级级级级级级级级级级级级级