以微薄的奖赏对环境中异质强化学习代理人进行合作培训:分享什么和何时? (Collaborative Training of Heterogeneous Reinforcement Learning Agents in Environments with Sparse Rewards: What and When to Share?)

In the early stages of human life, babies develop their skills by exploring different scenarios motivated by their inherent satisfaction rather than by extrinsic rewards from the environment. This behavior, referred to as intrinsic motivation, has emerged as one solution to address the exploration challenge derived from reinforcement learning environments with sparse rewards. Diverse exploration approaches have been proposed to accelerate the learning process over single- and multi-agent problems with homogeneous agents. However, scarce studies have elaborated on collaborative learning frameworks between heterogeneous agents deployed into the same environment, but interacting with different instances of the latter without any prior knowledge. Beyond the heterogeneity, each agent's characteristics grant access only to a subset of the full state space, which may hide different exploration strategies and optimal solutions. In this work we combine ideas from intrinsic motivation and transfer learning. Specifically, we focus on sharing parameters in actor-critic model architectures and on combining information obtained through intrinsic motivation with the aim of having a more efficient exploration and faster learning. We test our strategies through experiments performed over a modified ViZDooM's My Way Home scenario, which is more challenging than its original version and allows evaluating the heterogeneity between agents. Our results reveal different ways in which a collaborative framework with little additional computational cost can outperform an independent learning process without knowledge sharing. Additionally, we depict the need for modulating correctly the importance between the extrinsic and intrinsic rewards to avoid undesired agent behaviors.

翻译：在人类生命的早期阶段,婴儿通过探索由内在满意感而不是环境的外部奖励所激发的不同情景来发展自己的技能。这种行为被称为内在动机,已经成为一种解决办法,用来应对从强化学习环境中产生的探索挑战,其好处微弱。提出了多样化的探索方法,以加速针对单一和多剂问题的学习过程,而采用同质物剂的单一和多剂问题;然而,很少的研究详细阐述了部署在同一环境中的不同物剂之间的合作学习框架,但又与后者的不同实例互动,而后者则没有任何前科。除了异质性外,每个物剂的特点只允许进入整个状态空间的一部分,这可能隐藏不同的探索战略和最佳解决办法。在这项工作中,我们把内在动机和转移学习的理念结合起来。具体地说,我们侧重于分享从演员-化学模型结构中获得的参数,以及将通过内在动机获得的信息结合起来,目的是进行更有效的探索和更快的学习。我们测试我们的战略是通过经过修改的VizDooM的内向家路情景进行的实验,这种实验比其原始版本更具挑战性,并允许评价各种遗传性的行为方式之间的重要性。我们把从内在动机中获取的知识的特性都能够用一种额外的模型来解释。我们通过一种不同的模型来解释。我们的结果显示我们所需要的方法,我们需要以不同的方法,我们需要以不同的方法来解释一种不同的方法,而不需要成本来解释,我们需要一种方法,我们需要一种不同的方法可以解释一种不同的方法,我们需要在一种不同的方法,我们需要一种不同的方法可以解释一种不同的方法,在一种方法的模型来解释一种不同的方法,在一种方法的模型,我们需要一种方法在一种方法的模型中去一种方法,我们需要一种不同的方法,在一种方法,在一种不同的方法,在一种方法的模型中进行一种方法的模型中进行一种不同的方法,我们所需要的方法,我们需要一种不同的方法的模型中去一种不同的方法,在一种不同的方法中去一种不同的方法中去一种不同的方法,我们所需要的方法,在一种方法,我们所需要的方法,在一种方法的学习方法中可以解释一种方法中可以解释一种不同的方法,我们需要一种方法中我们需要在一种方法,我们需要在一种不同的方法中去一种不同的方法,在一种方法,在一种不同的方法,在一种不同的方法中可以解释一种方法,在一种不同的方法中可以解释一种方法中可以解释一种方法中进行一种不同的方法中进行一种不同的方法,我们