In the multi-armed bandit framework, there are two formulations that are commonly employed to handle time-varying reward distributions: adversarial bandit and nonstationary bandit. Although their oracles, algorithms, and regret analysis differ significantly, we provide a unified formulation in this paper that smoothly bridges the two as special cases. The formulation uses an oracle that takes the best-fixed arm within time windows. Depending on the window size, it turns into the oracle in hindsight in the adversarial bandit and dynamic oracle in the nonstationary bandit. We provide algorithms that attain the optimal regret with the matching lower bound.
翻译:在多武装土匪框架内,通常使用两种配方来处理时间变化式的奖励分配:对抗性土匪和非静态土匪。虽然它们的神器、算法和遗憾分析差异很大,但我们在本文中提供了一种统一配方,将两者作为特殊情况顺利地连接起来。这种配方使用一个在时间窗口内使用最固定手臂的神器。视窗口大小而定,它变成对立性土匪的后视和非静态土匪的动态神器。我们提供算法,让匹配的更低约束最感遗憾。