We consider the multi-agent reinforcement learning setting with imperfect information in which each agent is trying to maximize its own utility. The reward function depends on the hidden state (or goal) of both agents, so the agents must infer the other players' hidden goals from their observed behavior in order to solve the tasks. We propose a new approach for learning in these domains: Self Other-Modeling (SOM), in which an agent uses its own policy to predict the other agent's actions and update its belief of their hidden state in an online manner. We evaluate this approach on three different tasks and show that the agents are able to learn better policies using their estimate of the other players' hidden states, in both cooperative and adversarial settings.
翻译:我们认为多试剂强化学习设置不完善,每个代理商都试图在其中最大限度地发挥自身的效用。奖赏功能取决于两个代理商的隐藏状态(或目标),因此代理商必须从他们观察到的行为中推断其他角色的隐藏目标,以便完成任务。我们提出了在这些领域学习的新方法:自我其他模式(SOM),其中,一个代理商利用自己的政策来预测另一个代理商的行为,并以在线方式更新其对其隐藏状态的信念。我们评估了三种不同任务中的这一方法,并表明这些代理商能够利用他们对其他角色隐藏状态的估计,在合作和对抗环境下,学习更好的政策。