While various multi-agent reinforcement learning methods have been proposed in cooperative settings, few works investigate how self-interested learning agents achieve mutual coordination in decentralized general-sum games and generalize pre-trained policies to non-cooperative opponents during execution. In this paper, we present Generalizable Risk-Sensitive Policy (GRSP). GRSP learns the distributions over agent's return and estimate a dynamic risk-seeking bonus to discover risky coordination strategies. Furthermore, to avoid overfitting to training opponents, GRSP learns an auxiliary opponent modeling task to infer opponents' types and dynamically alter corresponding strategies during execution. Empirically, agents trained via GRSP can achieve mutual coordination during training stably and avoid being exploited by non-cooperative opponents during execution. To the best of our knowledge, it is the first method to learn coordination strategies between agents both in iterated prisoner's dilemma (IPD) and iterated stag hunt (ISH) without shaping opponents or rewards, and firstly consider generalization during execution. Furthermore, we show that GRSP can be scaled to high-dimensional settings.
翻译:虽然在合作环境中提出了多种多剂强化学习方法,但很少有人会调查自我感兴趣的学习机构如何在分散的普通游戏中实现相互协调,并在执行过程中将事先培训的政策推广到不合作的反对者。我们在本文件中介绍了普遍风险敏感政策(GRSP )。GRSSP了解了代理者返回的分布情况,并估计了一种动态的风险搜索奖金,以发现危险的协调战略。此外,为避免过度适应培训对手,GRSSP学会了辅助对手模型化任务,以推断反对者的类型和动态改变在执行过程中的相应战略。同时,通过GRSSP培训的代理人可以在培训期间实现相互协调,避免在执行过程中被不合作的反对者所利用。据我们所知,这是在不塑造反对者或奖赏的情况下,学习电磁鹿猎(IS)中的代理人之间协调战略,以及首先考虑执行过程中的概括化。我们还表明,GRSSP可以扩大到高层次的环境。