In multi-agent reinforcement learning (MARL), independent learners are those that do not observe the actions of other agents in the system. Due to the decentralization of information, it is challenging to design independent learners that drive play to equilibrium. This paper investigates the feasibility of using satisficing dynamics to guide independent learners to approximate equilibrium in stochastic games. For $\epsilon \geq 0$, an $\epsilon$-satisficing policy update rule is any rule that instructs the agent to not change its policy when it is $\epsilon$-best-responding to the policies of the remaining players; $\epsilon$-satisficing paths are defined to be sequences of joint policies obtained when each agent uses some $\epsilon$-satisficing policy update rule to select its next policy. We establish structural results on the existence of $\epsilon$-satisficing paths into $\epsilon$-equilibrium in both symmetric $N$-player games and general stochastic games with two players. We then present an independent learning algorithm for $N$-player symmetric games and give high probability guarantees of convergence to $\epsilon$-equilibrium under self-play. This guarantee is made using symmetry alone, leveraging the previously unexploited structure of $\epsilon$-satisficing paths.
翻译:在多试剂强化学习(MARL)中,独立学习者是那些不观察系统中其他代理者行动的人。由于信息分散化,设计驱动游戏走向平衡的独立学习者具有挑战性。本文调查使用卫星反射动态来引导独立学习者在随机游戏中达到近似平衡的可行性。对于$\epsilon\geq 0美元,一个$\epsilon$-满意的政策更新规则是指示代理者在对其余玩家的政策反应为$\epslon$-最佳响应时不改变其政策的任何规则。由于信息分散化,设计独立学习路径是具有挑战性的独立学习方式,使用美元-满意政策更新规则来选择下一个政策。我们建立关于存在$\epsilon$-满足政策更新规则的任何结构,在对其余玩家的政策进行对价$-最佳响应$$-最佳响应; 美元-满足路径被确定为联合政策序列的序列,当每个代理使用一些美元-高级游戏机的自平面游戏机级游戏机中,在高概率游戏中进行双级的自平级游戏中学习。