One practical requirement in solving dynamic games is to ensure that the players play well from any decision point onward. To satisfy this requirement, existing efforts focus on equilibrium refinement, but the scalability and applicability of existing techniques are limited. In this paper, we propose Temporal-Induced Self-Play (TISP), a novel reinforcement learning-based framework to find strategies with decent performances from any decision point onward. TISP uses belief-space representation, backward induction, policy learning, and non-parametric approximation. Building upon TISP, we design a policy-gradient-based algorithm TISP-PG. We prove that TISP-based algorithms can find approximate Perfect Bayesian Equilibrium in zero-sum one-sided stochastic Bayesian games with finite horizon. We test TISP-based algorithms in various games, including finitely repeated security games and a grid-world game. The results show that TISP-PG is more scalable than existing mathematical programming-based methods and significantly outperforms other learning-based methods.
翻译:解决动态游戏的一个实际要求是,确保玩家从任何决定点起就很好地玩。为了满足这一要求,现有努力侧重于平衡的完善,但现有技术的可缩放性和适用性有限。在本文件中,我们提议采用时间诱导自玩(TISP)这一新的强化学习框架,以找到具有任何决定点起的体面表现的战略。TISP使用信仰-空间代表、后向上演、政策学习和非参数近似。在TISP的基础上,我们设计了以政策为优先的TISP-PG算法。我们证明,基于TIS的算法可以找到近似完美的巴伊斯平衡法,以零和单向的单向式波亚游戏。我们在各种游戏中测试基于TISP的算法,包括有限的重复安全游戏和网格世界游戏。结果显示,TISPG比现有的数学编程方法更具有伸缩性,而且大大超出其他学习方法。