Most deep reinforcement learning algorithms are data inefficient in complex and rich environments, limiting their applicability to many scenarios. One direction for improving data efficiency is multitask learning with shared neural network parameters, where efficiency may be improved through transfer across related tasks. In practice, however, this is not usually observed, because gradients from different tasks can interfere negatively, making learning unstable and sometimes even less data efficient. Another issue is the different reward schemes between tasks, which can easily lead to one task dominating the learning of a shared model. We propose a new approach for joint training of multiple tasks, which we refer to as Distral (Distill & transfer learning). Instead of sharing parameters between the different workers, we propose to share a "distilled" policy that captures common behaviour across tasks. Each worker is trained to solve its own task while constrained to stay close to the shared policy, while the shared policy is trained by distillation to be the centroid of all task policies. Both aspects of the learning process are derived by optimizing a joint objective function. We show that our approach supports efficient transfer on complex 3D environments, outperforming several related methods. Moreover, the proposed learning process is more robust and more stable---attributes that are critical in deep reinforcement learning.
翻译:最深入强化的学习算法是复杂和丰富环境中的数据效率低,限制了数据应用到许多假设情景。提高数据效率的一个方向是多任务学习,共享神经网络参数,这样可以通过相关任务的转移提高效率。然而,在实践中,通常没有观察到这一点,因为不同任务的梯度可以消极地干预,使学习不稳定,有时甚至更低的数据效率。另一个问题是任务之间的不同奖励计划,这很容易导致一个任务支配着对共同模式的学习。我们提出了对多种任务进行联合培训的新方法,我们称之为“斯特拉尔”(保留和转移学习)。我们建议不同工人之间共享参数,而不是共享一个“留存”政策,该政策可以捕捉到不同任务之间的共同行为。每个工人都受过培训,能够解决自己的任务,同时要与共同的政策保持一致,而共同的政策则通过蒸馏来培训,成为所有任务政策的中间体。学习过程的两个方面都是通过优化联合目标功能而衍生的。我们表明,我们的方法支持在复杂的三维环境上高效的转让,比几个相关方法要好。此外,拟议的强化学习过程更稳定、更稳定、更稳定。