Individual rationality, which involves maximizing expected individual returns, does not always lead to high-utility individual or group outcomes in multi-agent problems. For instance, in multi-agent social dilemmas, Reinforcement Learning (RL) agents trained to maximize individual rewards converge to a low-utility mutually harmful equilibrium. In contrast, humans evolve useful strategies in such social dilemmas. Inspired by ideas from human psychology that attribute this behavior to the status-quo bias, we present a status-quo loss (SQLoss) and the corresponding policy gradient algorithm that incorporates this bias in an RL agent. We demonstrate that agents trained with SQLoss learn high-utility policies in several social dilemma matrix games (Prisoner's Dilemma, Stag Hunt matrix variant, Chicken Game). We show how SQLoss outperforms existing state-of-the-art methods to obtain high-utility policies in visual input non-matrix games (Coin Game and Stag Hunt visual input variant) using pre-trained cooperation and defection oracles. Finally, we show that SQLoss extends to a 4-agent setting by demonstrating the emergence of cooperative behavior in the popular Braess' paradox.
翻译:个人理性意味着最大限度地实现预期的个人回报,但并不总是导致多试剂问题中的高效用个人或群体结果。例如,在多试剂社会困境中,受过培训的强化学习(RL)代理机构在培训中学习了高效用政策,以最大限度地提高个人回报,从而形成低效用相互有害的平衡。相比之下,人类在这种社会困境中发展了有用的战略。在将这种行为归因于地位与配额偏见的人类心理学思想的启发下,我们呈现了一种地位与配额损失(SQLos)和相应的政策梯度算法,将这种偏差纳入一个RL代理。我们证明,在SQLos公司培训的代理机构在一些社会困境矩阵游戏(监狱长的Dilemma、Stag Hunt矩阵变异体、鸡游戏)中学习了高效用政策。我们展示了SQLOs如何超越现有的最新方法,在视觉输入非匹配游戏(Coin Game and Stattlement Flations )中获取高效用政策,我们展示了SQLO's的典型行为。