Many hierarchical reinforcement learning algorithms utilise a series of independent skills as a basis to solve tasks at a higher level of reasoning. These algorithms don't consider the value of using skills that are cooperative instead of independent. This paper proposes the Cooperative Consecutive Policies (CCP) method of enabling consecutive agents to cooperatively solve long time horizon multi-stage tasks. This method is achieved by modifying the policy of each agent to maximise both the current and next agent's critic. Cooperatively maximising critics allows each agent to take actions that are beneficial for its task as well as subsequent tasks. Using this method in a multi-room maze domain and a peg in hole manipulation domain, the cooperative policies were able to outperform a set of naive policies, a single agent trained across the entire domain, as well as another sequential HRL algorithm.
翻译:许多等级强化学习算法利用一系列独立技能作为解决更高层次推理任务的基础。 这些算法不考虑使用合作而不是独立技能的价值。 本文件提出合作连续代理商合作解决长期跨时跨跨阶段任务的方法。 实现这一方法的途径是修改每个代理商的政策,使当前和下一个代理商的批评意见最大化。 合作最大化的批评者允许每个代理商采取有利于其任务和随后任务的行动。 在多房间迷宫领域和孔操作领域使用这种方法,合作政策能够超越一套天真政策、一个在整个领域受训的单一代理商以及另一个连续的HRL算法。