We derive an algorithm that achieves the optimal (within constants) pseudo-regret in both adversarial and stochastic multi-armed bandits without prior knowledge of the regime and time horizon. The algorithm is based on online mirror descent (OMD) with Tsallis entropy regularization with power $\alpha=1/2$ and reduced-variance loss estimators. More generally, we define an adversarial regime with a self-bounding constraint, which includes stochastic regime, stochastically constrained adversarial regime (Wei and Luo), and stochastic regime with adversarial corruptions (Lykouris et al.) as special cases, and show that the algorithm achieves logarithmic regret guarantee in this regime and all of its special cases simultaneously with the adversarial regret guarantee.} The algorithm also achieves adversarial and stochastic optimality in the utility-based dueling bandit setting. We provide empirical evaluation of the algorithm demonstrating that it significantly outperforms UCB1 and EXP3 in stochastic environments. We also provide examples of adversarial environments, where UCB1 and Thompson Sampling exhibit almost linear regret, whereas our algorithm suffers only logarithmic regret. To the best of our knowledge, this is the first example demonstrating vulnerability of Thompson Sampling in adversarial environments. Last, but not least, we present a general stochastic analysis and a general adversarial analysis of OMD algorithms with Tsallis entropy regularization for $\alpha\in[0,1]$ and explain the reason why $\alpha=1/2$ works best.
翻译:更一般地说,我们定义了一种具有自我约束限制的敌对制度,包括价格扭曲制度、严格约束的对抗制度(Wei和Luo),以及具有对抗性腐败(Lykouris et al.)的随机制度,作为特例,并表明该算法在这个制度及其所有特例中实现了对数的遗憾保证,同时还得到了对数的遗憾保证。}这种算法还实现了基于公用事业的决斗设置的对抗性和对数的最佳性。我们从经验上评估了算法,表明它大大超过了UCB1和EXP3,但在最不透明的环境中,也存在对抗性腐败(Lykouris et al.),作为特例,并显示该算法在这个制度及其所有特例中实现了对数性的遗憾保证,同时提供了以美元为单位的对数的正数。