Each year, expert-level performance is attained in increasingly-complex multiagent domains, where notable examples include Go, Poker, and StarCraft II. This rapid progression is accompanied by a commensurate need to better understand how such agents attain this performance, to enable their safe deployment, identify limitations, and reveal potential means of improving them. In this paper we take a step back from performance-focused multiagent learning, and instead turn our attention towards agent behavior analysis. We introduce a model-agnostic method for discovery of behavior clusters in multiagent domains, using variational inference to learn a hierarchy of behaviors at the joint and local agent levels. Our framework makes no assumption about agents' underlying learning algorithms, does not require access to their latent states or policies, and is trained using only offline observational data. We illustrate the effectiveness of our method for enabling the coupled understanding of behaviors at the joint and local agent level, detection of behavior changepoints throughout training, discovery of core behavioral concepts, demonstrate the approach's scalability to a high-dimensional multiagent MuJoCo control domain, and also illustrate that the approach can disentangle previously-trained policies in OpenAI's hide-and-seek domain.
翻译:在日益复杂的多试剂领域,专家一级的业绩每年都在日益复杂的多试剂领域取得,其中显著的例子包括戈、波克和StarCraft II。这一快速进展伴随着一种相应的需要,即更好地了解这些代理人如何取得这种业绩,以便能够安全地部署,查明局限性,并揭示可能的改进手段。在本文件中,我们从注重业绩的多试剂学习中倒退了一步,转而将注意力转向代理人行为分析。我们采用了一种在多试剂领域发现行为集群的模型-不可知性方法,采用不同的推论来学习联合和地方代理人一级的行为等级。我们的框架不假定代理人的基本学习算法,不要求接触其潜在的状态或政策,而是仅使用离线观测数据进行培训。我们展示了我们在联合和地方代理人一级能够同时理解各种行为的方法的有效性,在整个培训过程中发现行为变化点,发现核心行为概念,展示该方法对于高层次多试剂MuJoco控制域的可伸缩性。我们还表明,这一方法可以混淆Open-traction AI 的隐藏域政策。