Self-play is a common paradigm for constructing solutions in Markov games that can yield optimal policies in collaborative settings. However, these policies often adopt highly-specialized conventions that make playing with a novel partner difficult. To address this, recent approaches rely on encoding symmetry and convention-awareness into policy training, but these require strong environmental assumptions and can complicate policy training. We therefore propose moving the learning of conventions to the belief space. Specifically, we propose a belief learning model that can maintain beliefs over rollouts of policies not seen at training time, and can thus decode and adapt to novel conventions at test time. We show how to leverage this model for both search and training of a best response over various pools of policies to greatly improve ad-hoc teamplay. We also show how our setup promotes explainability and interpretability of nuanced agent conventions.
翻译:自我游戏是在Markov游戏中构建解决方案的常见范例,在协作环境中可以产生最佳政策。然而,这些政策往往采用高度专业化的公约,使与新伙伴打交道变得困难。为了解决这一问题,最近的做法依赖于在政策培训中进行编码对称和公约意识,但这些做法需要强有力的环境假设,并可能使政策培训复杂化。因此,我们建议将公约的学习转移到信仰空间。具体地说,我们提出了一个信仰学习模式,可以维持信念,而不是在培训时所看不到的政策的推出,从而可以在测试时解码和适应新的公约。我们展示了如何利用这一模式,在各种政策组合中寻找和培训最佳对策,以大大改进特别的团队活动。我们还展示了我们的设置如何促进微妙的代理人公约的可解释性和可解释性。