We consider online no-regret learning in unknown games with bandit feedback, where each player can only observe its reward at each time -- determined by all players' current joint action -- rather than its gradient. We focus on the class of smooth and strongly monotone games and study optimal no-regret learning therein. Leveraging self-concordant barrier functions, we first construct a new bandit learning algorithm and show that it achieves the single-agent optimal regret of $\tilde{\Theta}(n\sqrt{T})$ under smooth and strongly concave reward functions ($n \geq 1$ is the problem dimension). We then show that if each player applies this no-regret learning algorithm in strongly monotone games, the joint action converges in the last iterate to the unique Nash equilibrium at a rate of $\tilde{\Theta}(\sqrt{\frac{n^2}{T}})$. Prior to our work, the best-known convergence rate in the same class of games is $\tilde{O}(\sqrt[3]{\frac{n^2}{T}})$ (achieved by a different algorithm), thus leaving open the problem of optimal no-regret learning algorithms (since the known lower bound is $\Omega(\sqrt{\frac{n^2}{T}})$). Our results thus settle this open problem and contribute to the broad landscape of bandit game-theoretical learning by identifying the first doubly optimal bandit learning algorithm, in that it achieves (up to log factors) both optimal regret in the single-agent learning and optimal last-iterate convergence rate in the multi-agent learning. We also present results on several application studies -- Cournot competition, Kelly auctions, and distributed regularized logistic regression -- to demonstrate the efficacy of our algorithm.
翻译:我们用土匪反馈来考虑在未知的游戏中进行在线不回报学习,让每个玩家在平滑和强烈的 concept 奖励功能下,只能观察其奖赏 -- -- 由所有玩家当前联合动作决定 -- -- 而不是其梯度。我们注重平滑和强烈单调游戏的等级,研究最佳的不回报游戏。利用自我调和屏障功能,我们首先构建一个新的土匪学习算法,并显示它达到了$\tilde\theta}(n\qqrt{T}) 的单一试探最佳遗憾。在我们的工作之前,同一游戏中最著名的共振奖赏功能(n\geq美元 1美元是宽度) 。我们然后显示,如果每个玩家在强烈的单调游戏中应用这种无回报的正规算法, 并且学习最优级算法, 因此,在最优级的算法中, 最优的解算法, 在最优的解算法中, 最优的解算为最优的(xxxxxxxxxxxxx) 。