Despite the great interest in the bandit problem, designing efficient algorithms for complex models remains challenging, as there is typically no analytical way to quantify uncertainty. In this paper, we propose Multiplier Bootstrap-based Exploration (MBE), a novel exploration strategy that is applicable to any reward model amenable to weighted loss minimization. We prove both instance-dependent and instance-independent rate-optimal regret bounds for MBE in sub-Gaussian multi-armed bandits. With extensive simulation and real data experiments, we show the generality and adaptivity of MBE.
翻译:尽管对强盗问题非常感兴趣,但设计复杂模型的有效算法仍具有挑战性,因为通常没有分析方法来量化不确定性。在本文件中,我们提出“倍增诱杀陷阱探索”(MBE),这是一项适用于任何可加权减低损失的奖励模式的新探索战略。我们证明,在亚加盟多武装强盗中,MBE既依赖实例,又依赖实例的速率最佳遗憾。通过广泛的模拟和真实的数据实验,我们展示了MBE的普遍性和适应性。