与没有人类数据的人类合作 (Collaborating with Humans without Human Data)

Collaborating with humans requires rapidly adapting to their individual strengths, weaknesses, and preferences. Unfortunately, most standard multi-agent reinforcement learning techniques, such as self-play (SP) or population play (PP), produce agents that overfit to their training partners and do not generalize well to humans. Alternatively, researchers can collect human data, train a human model using behavioral cloning, and then use that model to train "human-aware" agents ("behavioral cloning play", or BCP). While such an approach can improve the generalization of agents to new human co-players, it involves the onerous and expensive step of collecting large amounts of human data first. Here, we study the problem of how to train agents that collaborate well with human partners without using human data. We argue that the crux of the problem is to produce a diverse set of training partners. Drawing inspiration from successful multi-agent approaches in competitive domains, we find that a surprisingly simple approach is highly effective. We train our agent partner as the best response to a population of self-play agents and their past checkpoints taken throughout training, a method we call Fictitious Co-Play (FCP). Our experiments focus on a two-player collaborative cooking simulator that has recently been proposed as a challenge problem for coordination with humans. We find that FCP agents score significantly higher than SP, PP, and BCP when paired with novel agent and human partners. Furthermore, humans also report a strong subjective preference to partnering with FCP agents over all baselines.

翻译：与人类合作需要快速适应其个人优势、弱点和偏好。不幸的是,大多数标准的多试剂强化学习技术,如自我游戏(SP)或人口游戏(PP),都产生比培训伙伴更适合其培训伙伴的代理商,而不能向人类普及。或者,研究人员可以收集人类数据,用行为性克隆培训人类模型,然后利用该模型来培训“人类认知”代理商(“行为性克隆游戏 ” ) 或BCP。虽然这种方法可以改善代理商向新的人类共同玩家的普及化,但它涉及首先收集大量人类数据的繁琐和昂贵的步骤。在这里,我们研究如何培训与人类伙伴进行良好合作的代理商而不使用人类数据的问题。我们认为,问题的症结在于产生一套不同的培训伙伴。从竞争领域成功的多试剂方法(“行为认知性克隆游戏游戏 ” ), 我们发现一种非常简单的方法非常有效。我们培训我们的代理合伙人,作为自我游戏代理商的最佳反应者, 和他们过去的检查站在培训过程中采取的一种方法, 一种我们叫FCP CD-PBCP 的强型公司为最近的一个实验。