Each year, expert-level performance is attained in increasingly-complex multiagent domains, notable examples including Go, Poker, and StarCraft II. This rapid progression is accompanied by a commensurate need to better understand how such agents attain this performance, to enable their safe deployment, identify limitations, and reveal potential means of improving them. In this paper we take a step back from performance-focused multiagent learning, and instead turn our attention towards agent behavior analysis. We introduce a model-agnostic method for discovery of behavior clusters in multiagent domains, using variational inference to learn a hierarchy of behaviors at the joint and local agent levels. Our framework makes no assumption about agents' underlying learning algorithms, does not require access to their latent states or policies, and is trained using only offline observational data. We illustrate the effectiveness of our method for enabling the coupled understanding of behaviors at the joint and local agent level, detection of behavior changepoints throughout training, discovery of core behavioral concepts, demonstrate the approach's scalability to a high-dimensional multiagent MuJoCo control domain, and also illustrate that the approach can disentangle previously-trained policies in OpenAI's hide-and-seek domain.
翻译:每年,专家一级的业绩都是在日益复杂的多试剂领域取得的,包括Go、Poker和StarCraft II等显著的例子。这一快速进展伴随着一种相应的需要,即更好地了解这些代理人如何取得这种业绩,以便能够安全地部署,查明局限性,并揭示可能的改进手段。在本文件中,我们从注重业绩的多试剂学习中退一步,转而将注意力转向代理人行为分析。我们引入了一种在多试剂领域发现行为集群的模型-不可知性方法,采用不同的推论来学习联合和地方代理人一级的行为等级。我们的框架不假定代理人的基本学习算法,不要求接触其潜在的状态或政策,而只接受离线观测数据的培训。我们展示了我们在联合和地方代理人一级能够同时理解行为的方法的有效性,在整个培训过程中探测行为变化点,发现核心行为概念,表明该方法对于高层次的多试剂MuJoco控制领域的可调整性。我们还说明,该方法能够打破OpenAI的隐藏域内存政策。