Adequate strategizing of agents behaviors is essential to solving cooperative MARL problems. One intuitively beneficial yet uncommon method in this domain is predicting agents future behaviors and planning accordingly. Leveraging this point, we propose a two-level hierarchical architecture that combines a novel information-theoretic objective with a trajectory prediction model to learn a strategy. To this end, we introduce a latent policy that learns two types of latent strategies: individual $z_A$, and relational $z_R$ using a modified Graph Attention Network module to extract interaction features. We encourage each agent to behave according to the strategy by conditioning its local $Q$ functions on $z_A$, and we further equip agents with a shared $Q$ function that conditions on $z_R$. Additionally, we introduce two regularizers to allow predicted trajectories to be accurate and rewarding. Empirical results on Google Research Football (GRF) and StarCraft (SC) II micromanagement tasks show that our method establishes a new state of the art being, to the best of our knowledge, the first MARL algorithm to solve all super hard SC II scenarios as well as the GRF full game with a win rate higher than $95\%$, thus outperforming all existing methods. Videos and brief overview of the methods and results are available at: https://sites.google.com/view/hier-strats-marl/home.
翻译:对代理人行为进行适当的战略分析对于解决合作MARL问题至关重要。 在这一领域,一个直观但不常见的有益方法就是预测代理人未来的行为和相应规划。 利用这一点,我们提出一个两级等级结构,将一个新的信息理论目标与轨迹预测模型结合起来,学习一项战略。 为此,我们引入了一项潜在政策,学习两种潜在战略: 个人$_A$和关系$z_R$, 使用修改后的图表关注网络模块来提取互动功能。 我们鼓励每个代理人按照战略行事,将当地$Q的功能调整为$z_A$,我们进一步为代理人配备一个共同的美元功能,即条件为$z_R$。 此外,我们引入两个规范者,让预测的轨迹准确和奖励。 谷歌研究足球(GRF)和StarCraft(SC)二的微观管理任务显示,我们的方法确立了一种新的艺术状态,我们最了解的是,第一个更高MAR_Q(Q$)的算法,用来解决所有超硬盘的SC- II的游戏结果。 我们引入了两种方法, 以全部的G- hal- habilegle view view 和所有的G- hal view