Transfer Learning has shown great potential to enhance single-agent Reinforcement Learning (RL) efficiency. Similarly, Multiagent RL (MARL) can also be accelerated if agents can share knowledge with each other. However, it remains a problem of how an agent should learn from other agents. In this paper, we propose a novel Multiagent Policy Transfer Framework (MAPTF) to improve MARL efficiency. MAPTF learns which agent's policy is the best to reuse for each agent and when to terminate it by modeling multiagent policy transfer as the option learning problem. Furthermore, in practice, the option module can only collect all agent's local experiences for update due to the partial observability of the environment. While in this setting, each agent's experience may be inconsistent with each other, which may cause the inaccuracy and oscillation of the option-value's estimation. Therefore, we propose a novel option learning algorithm, the successor representation option learning to solve it by decoupling the environment dynamics from rewards and learning the option-value under each agent's preference. MAPTF can be easily combined with existing deep RL and MARL approaches, and experimental results show it significantly boosts the performance of existing methods in both discrete and continuous state spaces.
翻译:同样,如果代理商能够相互交流知识,多试剂RL(MARL)也可以加快速度。然而,这仍然是一个问题,即代理商应当如何向其他代理商学习。在本文件中,我们提出了一个新的多试剂政策转让框架(MAPTF),以提高MARL效率。MAPTF了解了哪种代理商政策是每个代理商最佳的再利用,何时通过模拟多试剂政策转让作为备选学习问题来终止该政策转让。此外,在实践中,由于环境的局部可耐性,该选项模块只能收集所有代理商的当地更新经验。在这一背景下,每个代理商的经验可能相互不一致,这可能造成选择价值估计的不准确性和振荡。因此,我们提出了一种新的备选学习算法,通过将环境动态与每个代理商的奖励和学习选择价值脱钩,从而解决后代代表选项。MARLTF可以很容易地与现有的深度RL和离散性业绩和实验结果相结合。