Recently a multi-agent variant of the classical multi-armed bandit was proposed to tackle fairness issues in online learning. Inspired by a long line of work in social choice and economics, the goal is to optimize the Nash social welfare instead of the total utility. Unfortunately previous algorithms either are not efficient or achieve sub-optimal regret in terms of the number of rounds $T$. We propose a new efficient algorithm with lower regret than even previous inefficient ones. For $N$ agents, $K$ arms, and $T$ rounds, our approach has a regret bound of $\tilde{O}(\sqrt{NKT} + NK)$. This is an improvement to the previous approach, which has regret bound of $\tilde{O}( \min(NK, \sqrt{N} K^{3/2})\sqrt{T})$. We also complement our efficient algorithm with an inefficient approach with $\tilde{O}(\sqrt{KT} + N^2K)$ regret. The experimental findings confirm the effectiveness of our efficient algorithm compared to the previous approaches.
翻译:最近提出了经典多武装土匪的多剂变体,以解决网上学习中的公平问题。在社会选择和经济学方面一长串工作的启发下,目标是优化纳什的社会福利,而不是总效用。 不幸的是,以前的算法不是效率不高,就是在回合数方面没有达到最优的遗憾。 我们提出的新的高效算法比以往低遗憾,甚至比以往低效率的算法要低。 对于美元代理法、美元军火和美元回合,我们的方法对美元tilde{O}(\qrt{NKT}+NK)有遗憾。这是对前一种方法的改进,因为前一种方法对美元tilde{O}(\qrt{NQ2K}(\qrt{K}+N%2K)有遗憾。 实验结果证实了我们有效算法与以往方法相比的有效性。