Multi-agent policy gradient methods have demonstrated success in games and robotics but are often limited to problems with low-level action space. However, when agents take higher-level, temporally-extended actions (i.e. options), when and how to derive a centralized control policy, its gradient as well as sampling options for all agents while not interrupting current option executions, becomes a challenge. This is mostly because agents may choose and terminate their options \textit{asynchronously}. In this work, we propose a conditional reasoning approach to address this problem, and empirically validate its effectiveness on representative option-based multi-agent cooperative tasks.
翻译:多试剂政策梯度方法在游戏和机器人方面已经证明是成功的,但往往局限于低行动空间的问题,然而,当代理商采取较高层次的、时间上延伸的行动(即选择方案),何时以及如何制定集中控制政策时,其梯度以及对所有代理商的抽样选择方案就成为一个挑战,而同时又不打断目前的选择方案处决。这主要是因为代理商可以选择和终止其选择方案 \ textit{asoncronoy} 。在这项工作中,我们提出了一个有条件的推理方法来解决这一问题,并用经验验证其在具有代表性的基于选择方案的多试剂合作任务上的有效性。