与强化学习合作和荣誉感光动力 (Cooperation and Reputation Dynamics with Reinforcement Learning)

Creating incentives for cooperation is a challenge in natural and artificial systems. One potential answer is reputation, whereby agents trade the immediate cost of cooperation for the future benefits of having a good reputation. Game theoretical models have shown that specific social norms can make cooperation stable, but how agents can independently learn to establish effective reputation mechanisms on their own is less understood. We use a simple model of reinforcement learning to show that reputation mechanisms generate two coordination problems: agents need to learn how to coordinate on the meaning of existing reputations and collectively agree on a social norm to assign reputations to others based on their behavior. These coordination problems exhibit multiple equilibria, some of which effectively establish cooperation. When we train agents with a standard Q-learning algorithm in an environment with the presence of reputation mechanisms, convergence to undesirable equilibria is widespread. We propose two mechanisms to alleviate this: (i) seeding a proportion of the system with fixed agents that steer others towards good equilibria; and (ii), intrinsic rewards based on the idea of introspection, i.e., augmenting agents' rewards by an amount proportionate to the performance of their own strategy against themselves. A combination of these simple mechanisms is successful in stabilizing cooperation, even in a fully decentralized version of the problem where agents learn to use and assign reputations simultaneously. We show how our results relate to the literature in Evolutionary Game Theory, and discuss implications for artificial, human and hybrid systems, where reputations can be used as a way to establish trust and cooperation.

翻译：建立合作激励机制是自然和人工系统中的一项挑战。一个潜在的答案是声誉,即代理人交易合作的直接成本,以获得良好声誉的未来利益。游戏理论模型表明,具体的社会规范可以稳定合作,但代理人如何独立地学会建立有效的声誉机制却不那么为人所理解。我们使用一个简单的强化学习模式,以表明声誉机制会产生两个协调问题:代理人需要学会如何协调现有声誉的含义,集体商定一种社会规范,以便根据他人的行为为他人树立声誉。这些协调问题表现出多重平衡,其中一些有效地建立了合作。当我们训练代理人在有声誉机制的环境中使用标准Q学习算法,就会发现具体的社会规范稳定,但与不理想的平衡机制的趋同却很普遍。我们建议采取两种机制来缓解这一点:(一) 将系统的一部分与固定代理人联系起来,引导他人走向良好的平衡;(二) 依据内行的内行判断思想,即人为的奖励,即增加代理人的报酬,其数额甚至与他们自己的战略表现相称,从而有效地建立合作关系。我们建议两种机制的组合是:在使用一种简单的机制中,在使用一种简单的机制中,在使用一种规则中,使信誉上学会如何稳定我们的声誉,从而使道德上, 学会学会学会学会学会如何使信誉产生一种完整的解释。