Traditional online learning models are typically initialized from scratch. By contrast, contemporary real-world applications often have access to historical datasets that can potentially enhanced the online learning processes. We study how offline data can be leveraged to facilitate online learning in stochastic multi-armed bandits and combinatorial bandits. In our study, the probability distributions that govern the offline data and the online rewards can be different. We first show that, without a non-trivial upper bound on their difference, no non-anticipatory policy can outperform the classical Upper Confidence Bound (UCB) policy, even with the access to offline data. In complement, we propose an online policy MIN-UCB for multi-armed bandits. MIN-UCB outperforms the UCB when such an upper bound is available. MIN-UCB adaptively chooses to utilize the offline data when they are deemed informative, and to ignore them otherwise. We establish that MIN-UCB achieves tight regret bounds, in both instance independent and dependent settings. We generalize our approach to the combinatorial bandit setting by introducing MIN-COMB-UCB, and we provide corresponding instance dependent and instance independent regret bounds. We illustrate how various factors, such as the biases and the size of offline datasets, affect the utility of offline data in online learning. We discuss several applications and conduct numerical experiments to validate our findings.
翻译:传统的在线学习模型通常从零开始初始化。相比之下,当代现实应用往往能够获取历史数据集,这些数据可能增强在线学习过程。我们研究了如何利用离线数据促进随机多臂赌博机和组合赌博机中的在线学习。在我们的研究中,支配离线数据和在线奖励的概率分布可能不同。我们首先证明,若没有关于两者差异的非平凡上界,即使能够访问离线数据,任何非预见性策略也无法超越经典的上置信界(UCB)策略。作为补充,我们提出了多臂赌博机的在线策略MIN-UCB。当此类上界可用时,MIN-UCB优于UCB策略。MIN-UCB自适应地选择在离线数据被认为具有信息量时加以利用,否则忽略它们。我们证实MIN-UCB在实例无关和实例依赖设置下均实现了紧致的遗憾界。通过引入MIN-COMB-UCB,我们将方法推广至组合赌博机设置,并提供了相应的实例依赖和实例无关遗憾界。我们阐释了离线数据的偏差、规模等多种因素如何影响其在在线学习中的效用。我们讨论了若干应用,并通过数值实验验证了研究结果。