We consider the Scale-Free Adversarial Multi Armed Bandit(MAB) problem, where the player only knows the number of arms $n$ and not the scale or magnitude of the losses. It sees bandit feedback about the loss vectors $l_1,\dots, l_T \in \mathbb{R}^n$. The goal is to bound its regret as a function of $n$ and $l_1,\dots, l_T$. We design a Follow The Regularized Leader(FTRL) algorithm, which comes with the first scale-free regret guarantee for MAB. It uses the log barrier regularizer, the importance weighted estimator, an adaptive learning rate, and an adaptive exploration parameter. In the analysis, we introduce a simple, unifying technique for obtaining regret inequalities for FTRL and Online Mirror Descent(OMD) on the probability simplex using Potential Functions and Mixed Bregmans. We also develop a new technique for obtaining local-norm lower bounds for Bregman Divergences, which are crucial in bandit regret bounds. These tools could be of independent interest.
翻译:我们考虑的是无限制反多种武装匪徒(MAB)问题,在这个问题上,玩家只知道武器的数量,而不知道损失的规模或规模。它看到土匪对损失矢量的反馈:$l_1,\dots, l_T\in\mathbb{R ⁇ n$。我们的目标是将遗憾作为美元和$1,\dots, l_T$的函数捆绑起来。我们设计了一个“正规化领导”算法(FTRL),该算法与MAB第一个无规模的遗憾保证相配。它使用了日志屏障定序器、加权估量器、适应性学习率和适应性探索参数。在分析中,我们采用了一种简单统一的方法,让FTRL和在线镜源(OMD)在使用潜在功能和混合布雷格曼的概率上获得遗憾不平等。我们还开发了一种新的技术,为Bregman Divergences获取本地-Norm 下限,这对Bregman Divergences来说至关重要。这些工具可能是独立的。这些工具。