具有转换成本的斯托切斯和对立强盗的算法 (An Algorithm for Stochastic and Adversarial Bandits with Switching Costs)

We propose an algorithm for stochastic and adversarial multiarmed bandits with switching costs, where the algorithm pays a price $\lambda$ every time it switches the arm being played. Our algorithm is based on adaptation of the Tsallis-INF algorithm of Zimmert and Seldin (2021) and requires no prior knowledge of the regime or time horizon. In the oblivious adversarial setting it achieves the minimax optimal regret bound of $O\big((\lambda K)^{1/3}T^{2/3} + \sqrt{KT}\big)$, where $T$ is the time horizon and $K$ is the number of arms. In the stochastically constrained adversarial regime, which includes the stochastic regime as a special case, it achieves a regret bound of $O\left(\big((\lambda K)^{2/3} T^{1/3} + \ln T\big)\sum_{i \neq i^*} \Delta_i^{-1}\right)$, where $\Delta_i$ are the suboptimality gaps and $i^*$ is a unique optimal arm. In the special case of $\lambda = 0$ (no switching costs), both bounds are minimax optimal within constants. We also explore variants of the problem, where switching cost is allowed to change over time. We provide experimental evaluation showing competitiveness of our algorithm with the relevant baselines in the stochastic, stochastically constrained adversarial, and adversarial regimes with fixed switching cost.

翻译：我们提出一个使用转换成本的随机和对抗性多武装强盗的算法,让算法每次开动手臂时都支付1美元。我们的算法的基础是调整齐默特和塞尔丁(2021年)的Tsallis-INF算法,不需要事先了解制度或时间范围。在模糊的对抗环境中,它达到最小最大最佳遗憾约束值为$Big(()lambda K)1/3}T ⁇ 1/3}}T ⁇ 3}}}}}}+\sqrt{KT ⁇ big)$,而美元是时间范围,美元是时间范围,而美元则是武器数量。在这种结构上受限制的对抗制度中,将随机系统制度作为特例,它达到了“Oforft(()lamda K)2/3}T ⁇ 1/3} +\\ lin\ listb) 最佳遗憾绑定的框框框框框框框框框框框框框框框框框框框框框框框框框框框框框框框框框框框框框框框框框框框框框框框框框框框框框框框框框框框框框框框框框框框框框框框框框框框框框框框框框框框框框框框框框框框框框框框框框框框框框框框框框框框框框框框框框框框框框框框框框框框框框框框框框框框框框框框框框框框框框框框框框框框框框框框框框框框框框框框框框框框框框框框框框框框框框框框框框框框框框框框框框框框框框框框框框框框框框框框框框框框框框框框框框框框框框框框框框框框框框框框框框框框框框框框框框框框框框框框框框框框框框框框框框框框框框框框框框框框框框框框框框框框框框框框框框框框框框框框框框框框框框框框框框框框框框框框框框框框框框框框框框框框框