We explore the use of policy approximations to reduce the computational cost of learning Nash equilibria in zero-sum stochastic games. We propose a new Q-learning type algorithm that uses a sequence of entropy-regularized soft policies to approximate the Nash policy during the Q-function updates. We prove that under certain conditions, by updating the regularized Q-function, the algorithm converges to a Nash equilibrium. We also demonstrate the proposed algorithm's ability to transfer previous training experiences, enabling the agents to adapt quickly to new environments. We provide a dynamic hyper-parameter scheduling scheme to further expedite convergence. Empirical results applied to a number of stochastic games verify that the proposed algorithm converges to the Nash equilibrium, while exhibiting a major speed-up over existing algorithms.
翻译:我们探索使用政策近似值来降低在零和随机游戏中学习纳什平衡的计算成本。 我们建议采用一种新的Q-学习型算法,在Q功能更新期间使用一系列对纳什政策的常规软政策来接近纳什政策。 我们证明在某些条件下,通过更新正规化的Q功能,算法会与纳什平衡相融合。 我们还展示了拟议的算法将先前的培训经验转换到新环境中的能力,使代理商能够迅速适应新的环境。我们提供了动态的超参数排期计划,以进一步加速趋同。对一些随机游戏应用的经验性结果证实,拟议的算法与纳什平衡一致,同时展示了对现有算法的重大加速。