Though transfer learning is promising to increase the learning efficiency, the existing methods are still subject to the challenges from long-horizon tasks, especially when expert policies are sub-optimal and partially useful. Hence, a novel algorithm named EASpace (Enhanced Action Space) is proposed in this paper to transfer the knowledge of multiple sub-optimal expert policies. EASpace formulates each expert policy into multiple macro actions with different execution time period, then integrates all macro actions into the primitive action space directly. Through this formulation, the proposed EASpace could learn when to execute which expert policy and how long it lasts. An intra-macro-action learning rule is proposed by adjusting the temporal difference target of macro actions to improve the data efficiency and alleviate the non-stationarity issue in multi-agent settings. Furthermore, an additional reward proportional to the execution time of macro actions is introduced to encourage the environment exploration via macro actions, which is significant to learn a long-horizon task. Theoretical analysis is presented to show the convergence of the proposed algorithm. The efficiency of the proposed algorithm is illustrated by a grid-based game and a multi-agent pursuit problem. The proposed algorithm is also implemented to real physical systems to justify its effectiveness.
翻译:虽然转移学习有提高学习效率的希望,但现有方法仍然受制于长期横向任务的挑战,特别是在专家政策不尽理想和部分有用的情况下,因此,本文件提议了一个名为EASpace(加强行动空间)的新算法,以转让多种次最佳专家政策的知识。EASpace将每个专家政策纳入具有不同执行期的多重宏观行动,然后将所有宏观行动直接纳入原始行动空间。通过这一提法,拟议的EASpace可以了解何时执行哪些专家政策及其持续时间。通过调整宏观行动的时间差异目标,以提高数据效率并缓解多剂环境下的不常态问题,提出了一项宏观行动内部学习规则。此外,还引入了与执行宏观行动时间相称的额外奖励,以鼓励通过宏观行动进行环境探索,这对于学习长期的里程任务具有重要意义。提出了理论分析,以显示拟议的算法的趋同程度。提议的算法的效率通过基于网格的游戏和多剂追寻问题来说明。拟议的算法的有效性也是通过实际的系统来证明的。