While single-agent policy optimization in a fixed environment has attracted a lot of research attention recently in the reinforcement learning community, much less is known theoretically when there are multiple agents playing in a potentially competitive environment. We take steps forward by proposing and analyzing new fictitious play policy optimization algorithms for zero-sum Markov games with structured but unknown transitions. We consider two classes of transition structures: factored independent transition and single-controller transition. For both scenarios, we prove tight $\widetilde{\mathcal{O}}(\sqrt{K})$ regret bounds after $K$ episodes in a two-agent competitive game scenario. The regret of each agent is measured against a potentially adversarial opponent who can choose a single best policy in hindsight after observing the full policy sequence. Our algorithms feature a combination of Upper Confidence Bound (UCB)-type optimism and fictitious play under the scope of simultaneous policy optimization in a non-stationary environment. When both players adopt the proposed algorithms, their overall optimality gap is $\widetilde{\mathcal{O}}(\sqrt{K})$.
翻译:虽然在固定环境中的单一代理政策优化最近在强化学习界引起了许多研究关注,但在理论上,当有多个代理人在潜在竞争环境中玩耍时,在理论上更不为人所知。 我们采取进步的步骤,为零和马尔科夫游戏提出并分析新的假游戏政策优化算法,这些算法结构结构结构结构结构结构结构结构结构结构结构结构结构结构结构结构结构结构结构结构结构结构结构结构结构结构结构结构结构结构结构结构结构结构不为人所知。我们考虑两种情况,在两种情况中,当两个参与者采用拟议的算法时,他们的总体最佳性差距是$\全基尔塔{O}(sqrt{K})美元。