An AI agent should be able to coordinate with humans to solve tasks. We consider the problem of training a Reinforcement Learning (RL) agent without using any human data, i.e., in a zero-shot setting, to make it capable of collaborating with humans. Standard RL agents learn through self-play. Unfortunately, these agents only know how to collaborate with themselves and normally do not perform well with unseen partners, such as humans. The methodology of how to train a robust agent in a zero-shot fashion is still subject to research. Motivated from the maximum entropy RL, we derive a centralized population entropy objective to facilitate learning of a diverse population of agents, which is later used to train a robust agent to collaborate with unseen partners. The proposed method shows its effectiveness compared to baseline methods, including self-play PPO, the standard Population-Based Training (PBT), and trajectory diversity-based PBT, in the popular Overcooked game environment. We also conduct online experiments with real humans and further demonstrate the efficacy of the method in the real world. A supplementary video showing experimental results is available at https://youtu.be/Xh-FKD0AAKE.
翻译:AI代理机构应该能够与人协调,解决任务。我们考虑培训强化学习(RL)代理机构而不使用任何人类数据的问题,也就是说,在零射环境中,培训一个强化学习(RL)代理机构,使其有能力与人类合作。标准RL代理机构通过自玩来学习。不幸的是,这些代理机构只知道如何与自己合作,通常不与人类等无形伙伴进行良好的工作。如何以零射方式培训一个强健的代理机构的方法仍然有待于研究。我们从最大摄像机RL中激发了一种集中的人口输入目标,以促进对各种代理群体进行学习,而后者后来被用来培训一个强健的代理机构与看不见的伙伴合作。拟议的方法显示了与基线方法相比的有效性,包括自玩PPO、标准人口培训(PBT)和在流行的超速游戏环境中以轨迹为基础的基于多样性的PBT。我们还与真正的人类进行在线实验,并进一步展示了在现实世界中的方法的功效。一个补充视频显示实验结果,可在 https://youA.A.X.A.Axx.