Reinforcement learning has been shown to be an effective strategy for automatically training policies for challenging control problems. Focusing on non-cooperative multi-agent systems, we propose a novel reinforcement learning framework for training joint policies that form a Nash equilibrium. In our approach, rather than providing low-level reward functions, the user provides high-level specifications that encode the objective of each agent. Then, guided by the structure of the specifications, our algorithm searches over policies to identify one that provably forms an $\epsilon$-Nash equilibrium (with high probability). Importantly, it prioritizes policies in a way that maximizes social welfare across all agents. Our empirical evaluation demonstrates that our algorithm computes equilibrium policies with high social welfare, whereas state-of-the-art baselines either fail to compute Nash equilibria or compute ones with comparatively lower social welfare.
翻译:强化学习被证明是应对控制问题自动培训政策的有效战略。 以不合作的多试剂系统为重点,我们提出一个新的强化学习框架,用于培训形成纳什均衡的联合政策。 在我们的方法中,用户提供高层次的规格,而不是提供低层次的奖励功能,以说明每个代理人的目标。 然后,在规格结构的指导下,我们用算法搜索政策,以确定可以形成美元-纳什平衡(可能性很大 ) 的政策。 重要的是,它优先考虑政策,使所有代理人的社会福利最大化。 我们的经验评估表明,我们的算法计算平衡政策时具有较高的社会福利,而最先进的基线要么没有计算纳什平衡,要么没有计算相对较低的社会福利。