In multi-agent reinforcement learning, the problem of learning to act is particularly difficult because the policies of co-players may be heavily conditioned on information only observed by them. On the other hand, humans readily form beliefs about the knowledge possessed by their peers and leverage beliefs to inform decision-making. Such abilities underlie individual success in a wide range of Markov games, from bluffing in Poker to conditional cooperation in the Prisoner's Dilemma, to convention-building in Bridge. Classical methods are usually not applicable to complex domains due to the intractable nature of hierarchical beliefs (i.e. beliefs of other agents' beliefs). We propose a scalable method to approximate these belief structures using recursive deep generative models, and to use the belief models to obtain representations useful to acting in complex tasks. Our agents trained with belief models outperform model-free baselines with equivalent representational capacity using common training paradigms. We also show that higher-order belief models outperform agents with lower-order models.
翻译:在多试剂强化学习中,学习行动的问题特别困难,因为共同玩家的政策可能严重地以他们只观察到的信息为条件。另一方面,人类很容易地形成对同龄人所拥有知识的信念,并利用信仰来为决策提供信息。这种能力是个人在一系列广泛的Markov游戏中取得成功的基础,从在Poker虚张声势,到在囚犯的Diilemma的有条件合作,到在大桥上建立公约。传统方法通常不适用于复杂的领域,因为等级观念(即其他代理人信仰的信仰)的棘手性质。我们提出一种可伸缩的方法,利用循环的深层基因模型来估计这些信仰结构,并利用信仰模型来获得有助于从事复杂工作的代表。我们受过信仰模型培训的代理人用共同的培训范式,超越了具有同等代表性的不完善的基线。我们还表明,较高层次的信仰模型优于具有较低等级模式的典型代理人。