Deep reinforcement learning (RL) is a promising approach to solving complex robotics problems. However, the process of learning through trial-and-error interactions is often highly time-consuming, despite recent advancements in RL algorithms. Additionally, the success of RL is critically dependent on how well the reward-shaping function suits the task, which is also time-consuming to design. As agents trained on a variety of robotics problems continue to proliferate, the ability to reuse their valuable learning for new domains becomes increasingly significant. In this paper, we propose a post-hoc technique for policy fusion using Optimal Transport theory as a robust means of consolidating the knowledge of multiple agents that have been trained on distinct scenarios. We further demonstrate that this provides an improved weights initialisation of the neural network policy for learning new tasks, requiring less time and computational resources than either retraining the parent policies or training a new policy from scratch. Ultimately, our results on diverse agents commonly used in deep RL show that specialised knowledge can be unified into a "Renaissance agent", allowing for quicker learning of new skills.
翻译:深入强化学习(RL)是解决复杂的机器人问题的一个很有希望的方法。然而,尽管最近RL算法的进步,通过试验和危险相互作用的学习过程往往非常耗时,尽管最近RL算法的进展。此外,RL的成功关键取决于奖励分层功能与任务是否相适应,这也耗费时间设计。随着关于各种机器人问题的培训代理人继续扩散,为新领域重新利用宝贵学习的能力变得日益重要。在本文中,我们提议采用最佳运输理论作为巩固在不同情况下受过培训的多个代理人的知识的有力手段,用于政策融合的热后技术。我们进一步证明,这为学习新任务提供了神经网络政策的改进权重初始化,比对父子政策进行再培训或从头培训新政策需要更少的时间和计算资源。最后,我们在深RL通常使用的各种代理人方面的研究结果表明,专业知识可以统一为“再生代理人”,以便更快地学习新的技能。