There is a recent trend of applying multi-agent reinforcement learning (MARL) to train an agent that can cooperate with humans in a zero-shot fashion without using any human data. The typical workflow is to first repeatedly run self-play (SP) to build a policy pool and then train the final adaptive policy against this pool. A crucial limitation of this framework is that every policy in the pool is optimized w.r.t. the environment reward function, which implicitly assumes that the testing partners of the adaptive policy will be precisely optimizing the same reward function as well. However, human objectives are often substantially biased according to their own preferences, which can differ greatly from the environment reward. We propose a more general framework, Hidden-Utility Self-Play (HSP), which explicitly models human biases as hidden reward functions in the self-play objective. By approximating the reward space as linear functions, HSP adopts an effective technique to generate an augmented policy pool with biased policies. We evaluate HSP on the Overcooked benchmark. Empirical results show that our HSP method produces higher rewards than baselines when cooperating with learned human models, manually scripted policies, and real humans. The HSP policy is also rated as the most assistive policy based on human feedback.
翻译:最近出现了一种趋势,即应用多试剂强化学习(MARL)来培训一个可以不使用任何人类数据而以零点方式与人类合作的代理人员。典型的工作流程是首先反复运行自我游戏(SP),以建立一个政策人才库,然后针对这一人才库培训最后的适应政策。这一框架的一个关键局限性是,该人才库中的每一项政策都被优化了(w.r.t.)环境奖励功能,这隐含地假设适应政策的测试伙伴将精确地优化同样的奖赏功能。然而,人类目标往往根据其自身的偏好而大为偏差,这与环境奖赏大不相同。我们提出了一个更一般性的框架,即隐藏效用自力自力(SP),以建立政策库中隐藏的奖赏功能为人类的模型。通过将奖赏空间作为线性功能,HSP采用一种有效的技术,以产生带有偏向性的政策人才库。我们根据过大的基准对HSP进行评估。Empicalal结果显示,我们的HSP方法在与学习的人类模型、手动脚政策、以及真正的人类反馈中,其政策也协助进行评级。