Zero-shot human-AI coordination holds the promise of collaborating with humans without human data. Prevailing methods try to train the ego agent with a population of partners via self-play. However, these methods suffer from two problems: 1) The diversity of a population with finite partners is limited, thereby limiting the capacity of the trained ego agent to collaborate with a novel human; 2) Current methods only provide a common best response for every partner in the population, which may result in poor zero-shot coordination performance with a novel partner or humans. To address these issues, we first propose the policy ensemble method to increase the diversity of partners in the population, and then develop a context-aware method enabling the ego agent to analyze and identify the partner's potential policy primitives so that it can take different actions accordingly. In this way, the ego agent is able to learn more universal cooperative behaviors for collaborating with diverse partners. We conduct experiments on the Overcooked environment, and evaluate the zero-shot human-AI coordination performance of our method with both behavior-cloned human proxies and real humans. The results demonstrate that our method significantly increases the diversity of partners and enables ego agents to learn more diverse behaviors than baselines, thus achieving state-of-the-art performance in all scenarios. We also open-source a human-AI coordination study framework on the Overcooked for the convenience of future studies.
翻译:零点人类-AI协调为与人类合作带来了没有人类数据的希望。 流行的方法试图通过自我游戏向自我代理者培训伙伴群体中的自我代理者。 但是,这些方法存在两个问题:(1) 与有限伙伴的自我代理者的多样性有限,从而限制了受过训练的自我代理者与新人合作的能力;(2) 目前的方法只能为人口中每一个伙伴提供一种共同的最佳反应,这可能导致与新伙伴或人类的零点协调业绩差。为了解决这些问题,我们首先提出政策共同方法,以增加人口伙伴的多样性,然后制定一种环境认知方法,使自我代理者能够分析和确定伙伴潜在政策原始体,从而能够相应地采取不同的行动。 以这种方式,自我代理者能够学习与不同伙伴合作的更普遍的合作行为。 我们进行关于过度环境的实验,并评估我们与新伙伴或人类之间的零点协调性业绩。 我们采用的方法与行为中心以及真实人之间的协调性,然后开发一种环境意识方法,使自我代理者能够大大地分析和确定伙伴的潜在政策原始方法,从而可以相应地采取不同的行动。 通过这种方法,使我们的自我代理者能够学习各种基础和自我代理者在未来的研究中学习各种情况。