We study decentralized policy learning in Markov games where we control a single agent to play with nonstationary and possibly adversarial opponents. Our goal is to develop a no-regret online learning algorithm that (i) takes actions based on the local information observed by the agent and (ii) is able to find the best policy in hindsight. For such a problem, the nonstationary state transitions due to the varying opponent pose a significant challenge. In light of a recent hardness result \citep{liu2022learning}, we focus on the setting where the opponent's previous policies are revealed to the agent for decision making. With such an information structure, we propose a new algorithm, \underline{D}ecentralized \underline{O}ptimistic hype\underline{R}policy m\underline{I}rror de\underline{S}cent (DORIS), which achieves $\sqrt{K}$-regret in the context of general function approximation, where $K$ is the number of episodes. Moreover, when all the agents adopt DORIS, we prove that their mixture policy constitutes an approximate coarse correlated equilibrium. In particular, DORIS maintains a \textit{hyperpolicy} which is a distribution over the policy space. The hyperpolicy is updated via mirror descent, where the update direction is obtained by an optimistic variant of least-squares policy evaluation. Furthermore, to illustrate the power of our method, we apply DORIS to constrained and vector-valued MDPs, which can be formulated as zero-sum Markov games with a fictitious opponent.
翻译:在Markov游戏中,我们研究分散的政策学习,我们控制一个单一的代理人与非静止的和可能敌对的对手玩耍。我们的目标是开发一个不回报的在线学习算法,以(一)根据代理人观察的当地信息采取行动,和(二)能够在事后观察中找到最佳的政策。对于这样一个问题,不同对手造成的非静止状态过渡构成重大挑战。鉴于最近的硬性结果\citep{liu2022学习},我们侧重于将对手以前的政策披露给决策代理人的设置。我们的目标是开发一个新的不回报的在线学习算法,即(一)根据代理人观察的当地信息采取行动,(二)能够在事后观察中找到最佳的政策。对于这样一个问题,由于不同的对手造成的非静止状态过渡,由于不同的对手造成的非静止状态过渡 {I}/rderline{S&}(DOR),在一般功能近似情况下, 将对手的先前政策披露为 $K$DK$, 当所有代理人都以正值汇率更新后, 我们的汇率更新了他们的汇率政策。