Many real-world scenarios involve a team of agents that have to coordinate their policies to achieve a shared goal. Previous studies mainly focus on decentralized control to maximize a common reward and barely consider the coordination among control policies, which is critical in dynamic and complicated environments. In this work, we propose factorizing the joint team policy into a graph generator and graph-based coordinated policy to enable coordinated behaviours among agents. The graph generator adopts an encoder-decoder framework that outputs directed acyclic graphs (DAGs) to capture the underlying dynamic decision structure. We also apply the DAGness-constrained and DAG depth-constrained optimization in the graph generator to balance efficiency and performance. The graph-based coordinated policy exploits the generated decision structure. The graph generator and coordinated policy are trained simultaneously to maximize the discounted return. Empirical evaluations on Collaborative Gaussian Squeeze, Cooperative Navigation, and Google Research Football demonstrate the superiority of the proposed method.
翻译:许多现实世界情景涉及一个必须协调其政策以实现共同目标的代理机构团队,以往的研究主要侧重于分散控制,以最大限度地获得共同回报,很少考虑控制政策之间的协调,而控制政策在动态和复杂环境中至关重要。在这项工作中,我们建议将联合团队政策纳入一个图形生成器和基于图形的协调政策,以使代理机构之间能够采取协调的行为。图形生成器采用一个编码器解码器框架,产出引导循环图形(DAGs)捕捉潜在的动态决策结构。我们还在图形生成器中应用DAGness限制和DAG深度限制优化,以平衡效率和绩效。基于图形的协调政策利用了生成的决策结构。图形生成器和协调政策同时接受培训,以最大限度地提高折扣回报。关于合作高山Squeze、合作导航和谷歌研究足球的实证评价显示了拟议方法的优势。