协调多机构普通和普通运动会的适应性学习风险敏感政策</s> (Learning Adaptable Risk-Sensitive Policies to Coordinate in Multi-Agent General-Sum Games)

In general-sum games, the interaction of self-interested learning agents commonly leads to socially worse outcomes, such as defect-defect in the iterated stag hunt (ISH). Previous works address this challenge by sharing rewards or shaping their opponents' learning process, which require too strong assumptions. In this paper, we demonstrate that agents trained to optimize expected returns are more likely to choose a safe action that leads to guaranteed but lower rewards. However, there typically exists a risky action that leads to higher rewards in the long run only if agents cooperate, e.g., cooperate-cooperate in ISH. To overcome this, we propose using action value distribution to characterize the decision's risk and corresponding potential payoffs. Specifically, we present Adaptable Risk-Sensitive Policy (ARSP). ARSP learns the distributions over agent's return and estimates a dynamic risk-seeking bonus to discover risky coordination strategies. Furthermore, to avoid overfitting training opponents, ARSP learns an auxiliary opponent modeling task to infer opponents' types and dynamically alter corresponding strategies during execution. Empirically, agents trained via ARSP can achieve stable coordination during training without accessing opponent's rewards or learning process, and can adapt to non-cooperative opponents during execution. To the best of our knowledge, it is the first method to learn coordination strategies between agents both in iterated prisoner's dilemma (IPD) and iterated stag hunt (ISH) without shaping opponents or rewards, and can adapt to opponents with distinct strategies during execution. Furthermore, we show that ARSP can be scaled to high-dimensional settings.

翻译：在一般游戏中,自我感兴趣的学习代理人的相互作用通常会导致社会上更差的结果,例如迭生鹿角狩猎(ISH)中的缺陷缺陷。以前的作品通过分享奖励或塑造对手的学习过程来应对这一挑战,而这要求的假设过于强烈。在本文中,我们表明,受过优化预期回报培训的代理人更可能选择安全行动,导致有保障但报酬较低。然而,通常存在一种风险行动,只有在代理人合作,例如合作合作开展ISH时,才能长期导致更高的回报。为了克服这一点,我们建议使用行动价值分配来说明决定的风险和相应的潜在回报。具体地说,我们提出了可调适风险敏感政策(ARSP)。在本文中,经过培训的代理人了解有关代理人的返回,并估计出一个动态风险搜索奖金,以发现危险的协调战略。此外,为了避免培训对手过量,ARSP学习辅助性对手的任务可以推导出反对者的类型,并在执行期间动态地调整相应的战略。为了克服这一点,通过ARSP培训的代理人可以在培训过程中实现稳定的协调,而不是学习对手的学习方法。</s>