We present the new efficient-Q learning dynamics for stochastic games beyond the recent concentration of progress on provable convergence to possibly inefficient equilibrium. We let agents follow the log-linear learning dynamics in stage games whose payoffs are the Q-functions and estimate the Q-functions iteratively with a vanishing stepsize. This (implicitly) two-timescale dynamic makes stage games relatively stationary for the log-linear update so that the agents can track the efficient equilibrium of stage games. We show that the Q-function estimates converge to the Q-function associated with the efficient equilibrium in identical-interest stochastic games, almost surely, with an approximation error induced by the softmax response in the log-linear update. The key idea is to approximate the dynamics with a fictional scenario where Q-function estimates are stationary over finite-length epochs. We then couple the dynamics in the main and fictional scenarios to show that the approximation error decays to zero due to the vanishing stepsize.
翻译:我们展示了新的高效- Q 游戏的学习动态, 超越最近进展集中在可变趋同到可能效率低下的平衡上。 我们让代理商在以Q函数为回报的舞台游戏中跟踪日志- 线性学习动态, 并反复估算Q函数, 并进行递减步骤的迭接。 这个( 隐含的) 两年级动态使舞台游戏相对固定, 以便代理商能够跟踪阶段性游戏的有效平衡。 我们显示, Q函数估计会与相同利益类同的游戏中与有效平衡相关的Q函数相趋一致, 几乎可以肯定, 由日志- 线性更新中的软负反应引发的近似错误 。 关键的想法是用虚构的假想来近似该动态, Q函数估计是固定在有限时间范围内的。 然后将主要和虚构性情景中的动态进行对比, 以显示近似误衰减为零, 原因是渐渐渐的步化 。