In multiagent systems, the complex interaction of fixed incentives can lead agents to outcomes that are poor (inefficient) not only for the group, but also for each individual. Price of anarchy is a technical, game-theoretic definition that quantifies the inefficiency arising in these scenarios -- it compares the welfare that can be achieved through perfect coordination against that achieved by self-interested agents at a Nash equilibrium. We derive a differentiable, upper bound on a price of anarchy that agents can cheaply estimate during learning. Equipped with this estimator, agents can adjust their incentives in a way that improves the efficiency incurred at a Nash equilibrium. Agents do so by learning to mix their reward (equiv. negative loss) with that of other agents by following the gradient of our derived upper bound. We refer to this approach as D3C. In the case where agent incentives are differentiable, D3C resembles the celebrated Win-Stay, Lose-Shift strategy from behavioral game theory, thereby establishing a connection between the global goal of maximum welfare and an established agent-centric learning rule. In the non-differentiable setting, as is common in multiagent reinforcement learning, we show the upper bound can be reduced via evolutionary strategies, until a compromise is reached in a distributed fashion. We demonstrate that D3C improves outcomes for each agent and the group as a whole on several social dilemmas including a traffic network exhibiting Braess's paradox, a prisoner's dilemma, and several multiagent domains.
翻译:在多试剂系统中,固定激励机制的复杂互动可以导致进退两难的结果,不仅对群体,而且对每个人来说都是如此。无政府状态的价格是一个技术性的、游戏理论性的定义,它量化了这些情景中产生的低效率 -- -- 它比较了通过完美协调而实现的福利,而这种协调则是在纳什均衡中由自我感兴趣的代理实现的。我们得出了一个不同且高度受无政府状态价格约束、代理者在学习期间可以低廉估计的无政府状态的代价。根据这个估测者,代理者可以调整其激励,以提高在纳什平衡中的效率。 无政府状态的价格是一个技术性的、游戏理论性的定义。 代理者可以学习将其奖赏(即负损失)与其他代理者的奖赏(比如说负损失)与其他代理者的奖赏(比如我们衍生的梯度)相混合,我们称之为D3C。在代理者奖励措施不同的情况下,D3C类似于人们所庆贺的Win-Ori,Lose-Shift战略, 也就是行为游戏理论,从而建立全球最大福利目标与我们以代理者为中心的学习规则之间的联系。在非统计中,在不易化过程中可以显示一个共同的进进化的进进进进进进进进进化战略。