We study the problem of guaranteeing low regret in repeated games against an opponent with unknown membership in one of several classes. We add the constraint that our algorithm is non-exploitable, in that the opponent lacks an incentive to use an algorithm against which we cannot achieve rewards exceeding some "fair" value. Our solution is an expert algorithm (LAFF) that searches within a set of sub-algorithms that are optimal for each opponent class and uses a punishment policy upon detecting evidence of exploitation by the opponent. With benchmarks that depend on the opponent class, we show that LAFF has sublinear regret uniformly over the possible opponents, except exploitative ones, for which we guarantee that the opponent has linear regret. To our knowledge, this work is the first to provide guarantees for both regret and non-exploitability in multi-agent learning.
翻译:我们研究如何在反复游戏中保证对在几个类中某一类中成员身份不明的对手的低后悔感的问题。我们增加了一种限制,即我们的算法是不可利用的,因为对手缺乏使用一种我们无法获得超过某种“公平”价值的奖励的算法的动力。我们的解决办法是专家算法(LAFF),在一套对每个对手类最合适的子算法中进行搜索,并在发现对手剥削的证据时采用惩罚政策。我们用取决于对手类的基准,我们表明LAFF对可能的对手的次线性遗憾是一致的,但剥削性除外,我们保证对手对此有线性遗憾。 据我们所知,这项工作首先为多剂学习中的遗憾和不可利用性提供保障。