Dominated actions are natural (and perhaps the simplest possible) multi-agent generalizations of sub-optimal actions as in standard single-agent decision making. Thus similar to standard bandit learning, a basic learning question in multi-agent systems is whether agents can learn to efficiently eliminate all dominated actions in an unknown game if they can only observe noisy bandit feedback about the payoff of their played actions. Surprisingly, despite a seemingly simple task, we show a quite negative result; that is, standard no regret algorithms -- including the entire family of Dual Averaging algorithms -- provably take exponentially many rounds to eliminate all dominated actions. Moreover, algorithms with the stronger no swap regret also suffer similar exponential inefficiency. To overcome these barriers, we develop a new algorithm that adjusts Exp3 with Diminishing Historical rewards (termed Exp3-DH); Exp3-DH gradually forgets history at carefully tailored rates. We prove that when all agents run Exp3-DH (a.k.a., self-play in multi-agent learning), all dominated actions can be iteratively eliminated within polynomially many rounds. Our experimental results further demonstrate the efficiency of Exp3-DH, and that state-of-the-art bandit algorithms, even those developed specifically for learning in games, fail to eliminate all dominated actions efficiently.
翻译:被淘汰的行动是自然的(或许是最简单的)多试剂,是标准单一代理人决策中最优化的行动的多试剂。类似标准的土匪学习,多试剂系统中的一个基本学习问题是代理商能否学会在一个未知的游戏中有效消除所有主导行动,如果他们只能观察关于其游戏行动回报的吵闹的土匪反馈。令人惊讶的是,尽管任务看起来很简单,但我们表现出相当消极的结果;这就是,标准的不后悔算法 -- -- 包括双动算法的整个家族 -- -- 可能采取大量回合来消除所有主导行动。此外,更强的不交换遗憾的算法也遭遇类似的快速效率低下。为了克服这些障碍,我们开发了一种新的算法,用历史奖励来调整3 Expat3 (定期 Exp3-DH);Exp3-DH逐渐地忘记历史,尽管其速度看似简单,但我们展示了一个相当消极的结果;当所有代理商运行3-DH(a.k.a.a.),在多动算法学习中的自我作用)时,所有主导行动都可以在多动游戏中被反复消除。此外,在多动的游戏中反复消除的游戏中,在多动的游戏中,我们实验性动作中,我们实验性动作的实验性动作的动作中可以具体地展示,在多动。