This paper considers offline multi-agent reinforcement learning. We propose the strategy-wise concentration principle which directly builds a confidence interval for the joint strategy, in contrast to the point-wise concentration principle that builds a confidence interval for each point in the joint action space. For two-player zero-sum Markov games, by exploiting the convexity of the strategy-wise bonus, we propose a computationally efficient algorithm whose sample complexity enjoys a better dependency on the number of actions than the prior methods based on the point-wise bonus. Furthermore, for offline multi-agent general-sum Markov games, based on the strategy-wise bonus and a novel surrogate function, we give the first algorithm whose sample complexity only scales $\sum_{i=1}^mA_i$ where $A_i$ is the action size of the $i$-th player and $m$ is the number of players. In sharp contrast, the sample complexity of methods based on the point-wise bonus would scale with the size of the joint action space $\Pi_{i=1}^m A_i$ due to the curse of multiagents. Lastly, all of our algorithms can naturally take a pre-specified strategy class $\Pi$ as input and output a strategy that is close to the best strategy in $\Pi$. In this setting, the sample complexity only scales with $\log |\Pi|$ instead of $\sum_{i=1}^mA_i$.
翻译:本文思考了离线多试剂强化学习。 我们建议了战略偏重集中原则, 直接为联合战略建立信任间隔, 与为联合行动空间中每个点建立信任间隔的点对点集中原则相反。 对于两个玩家的马可夫游戏, 我们通过利用战略分红的精度, 提出了一个计算效率的算法, 其样本复杂性比先前基于点对点的奖金的方法更加依赖于行动次数。 此外, 对于离线多试办的普通和马尔科夫游戏, 我们给出第一个算法, 其样本复杂性仅标定为$\ sum_i=1mA_i$, 其中$A_i是美元玩家的动作大小, 美元是玩家的数目。 相比之下, 基于点对点的奖金方法的精度复杂度将随着联合行动空间的大小 $\ P ⁇ i=1 ⁇ m A_i$, 因为多试剂的诅咒。 最后, 我们所有A_i 的样本战略, 和 美元的精度战略, 将自然地设定为美元的精度。