Existing studies on provably efficient algorithms for Markov games (MGs) almost exclusively build on the "optimism in the face of uncertainty" (OFU) principle. This work focuses on a different approach of posterior sampling, which is celebrated in many bandits and reinforcement learning settings but remains under-explored for MGs. Specifically, for episodic two-player zero-sum MGs, a novel posterior sampling algorithm is developed with general function approximation. Theoretical analysis demonstrates that the posterior sampling algorithm admits a $\sqrt{T}$-regret bound for problems with a low multi-agent decoupling coefficient, which is a new complexity measure for MGs, where $T$ denotes the number of episodes. When specialized to linear MGs, the obtained regret bound matches the state-of-the-art results. To the best of our knowledge, this is the first provably efficient posterior sampling algorithm for MGs with frequentist regret guarantees, which enriches the toolbox for MGs and promotes the broad applicability of posterior sampling.
翻译:有关Markov游戏(MGs)现有有效算法的现有研究几乎完全建立在“面对不确定性的乐观”原则(OFU)的基础上。这项工作侧重于一种不同的后方取样方法,在许多强盗和强化学习环境中庆祝,但对于MGs来说仍然未得到充分探讨。具体地说,对于前两个玩家零和MGs,一种新型的后方取样算法是用一般功能近似法来开发的。理论分析表明,后方取样算法承认,对于低多试剂脱钩系数的问题,需要花费$-regret($-ret)-ret,这是对MGs的新复杂度措施,其中$T表示事件的数量。当对线型MGs专门研究时,所获的遗憾与最新技术结果相匹配。据我们所知,这是第一次以经常的遗憾保证为MGs提供的可证实有效的后方取样算算法,这丰富了MGs的工具包,并促进了海后方取样的广泛适用性。