We study online learning problems in which the learner has extra knowledge about the adversary's behaviour, i.e., in game-theoretic settings where opponents typically follow some no-external regret learning algorithms. Under this assumption, we propose two new online learning algorithms, Accurate Follow the Regularized Leader (AFTRL) and Prod-Best Response (Prod-BR), that intensively exploit this extra knowledge while maintaining the no-regret property in the worst-case scenario of having inaccurate extra information. Specifically, AFTRL achieves $O(1)$ external regret or $O(1)$ \emph{forward regret} against no-external regret adversary in comparison with $O(\sqrt{T})$ \emph{dynamic regret} of Prod-BR. To the best of our knowledge, our algorithm is the first to consider forward regret that achieves $O(1)$ regret against strategic adversaries. When playing zero-sum games with Accurate Multiplicative Weights Update (AMWU), a special case of AFTRL, we achieve \emph{last round convergence} to the Nash Equilibrium. We also provide numerical experiments to further support our theoretical results. In particular, we demonstrate that our methods achieve significantly better regret bounds and rate of last round convergence, compared to the state of the art (e.g., Multiplicative Weights Update (MWU) and its optimistic counterpart, OMWU).
翻译:我们研究在线学习问题,让学习者对对手的行为有额外了解,即,在游戏理论环境中,反对者通常遵循某种非外部的遗憾学习算法。在这个假设下,我们提出两种新的在线学习算法,即 " 精确跟随正规领袖(AFTRL)和Prod-Best Responsion(Prod-BR) ",在最坏情况下,在拥有不准确的额外信息的情况下,大量利用这种额外知识,同时保持无回报属性。具体地说,AFTRL取得了O(1)美元外部遗憾或1美元/emph{Forward Regard}与无外部的遗憾对手相比。在这个假设下,我们提出了两种新的在线学习算法: " 精确跟随正规领袖(AFTRL) " 和 " Prod-BR(Prod-BR) " ) 。根据我们的知识,我们的算法是首先考虑对战略对手获得1美元遗憾的先验。在与快速的超正反调游戏中(AMWIights U),这是AFTRL的一个特殊案例,我们取得了更强烈的理论趋近的趋近结果。