We study the problem of designing adaptive multi-armed bandit algorithms that perform optimally in both the stochastic setting and the adversarial setting simultaneously (often known as a best-of-both-world guarantee). A line of recent works shows that when configured and analyzed properly, the Follow-the-Regularized-Leader (FTRL) algorithm, originally designed for the adversarial setting, can in fact optimally adapt to the stochastic setting as well. Such results, however, critically rely on an assumption that there exists one unique optimal arm. Recently, Ito (2021) took the first step to remove such an undesirable uniqueness assumption for one particular FTRL algorithm with the $\frac{1}{2}$-Tsallis entropy regularizer. In this work, we significantly improve and generalize this result, showing that uniqueness is unnecessary for FTRL with a broad family of regularizers and a new learning rate schedule. For some regularizers, our regret bounds also improve upon prior results even when uniqueness holds. We further provide an application of our results to the decoupled exploration and exploitation problem, demonstrating that our techniques are broadly applicable.
翻译:我们研究设计适应性多武装盗匪算法的问题,这种算法既在随机环境,又在对抗环境中同时发挥最佳效果(通常称为最佳世界保证)。最近一行工程表明,如果适当地配置和分析,最初为对抗环境设计的“跟踪-分层-线索”算法实际上也可以最佳地适应随机环境。然而,这种结果严重地依赖于一种假设,即存在着一种独特的最佳手臂。最近,Ito (2021年)迈出了第一步,取消了对一种特定的FTRL和$\\frac{1<unk> 2}-Tsallis 加密器的不可取的独特性假设。在这项工作中,我们大大改进和概括了这一结果,表明对于FTRL来说,没有必要与广泛的监管者和新的学习率计划保持独特性。对于某些规范者来说,我们遗憾的界限也随着先前的结果而得到改善,即使独一时,我们也保持独特性。我们进一步运用了我们的结果来应对脱钩的勘探和开发问题。</s>