Motivated by a range of applications, we study in this paper the problem of transfer learning for nonparametric contextual multi-armed bandits under the covariate shift model, where we have data collected on source bandits before the start of the target bandit learning. The minimax rate of convergence for the cumulative regret is established and a novel transfer learning algorithm that attains the minimax regret is proposed. The results quantify the contribution of the data from the source domains for learning in the target domain in the context of nonparametric contextual multi-armed bandits. In view of the general impossibility of adaptation to unknown smoothness, we develop a data-driven algorithm that achieves near-optimal statistical guarantees (up to a logarithmic factor) while automatically adapting to the unknown parameters over a large collection of parameter spaces under an additional self-similarity assumption. A simulation study is carried out to illustrate the benefits of utilizing the data from the auxiliary source domains for learning in the target domain.
翻译:在一系列应用的推动下,我们在本文件中研究了非对称背景多武装匪徒在共变式变换模式下转学的问题,在这种变换模式下,我们在目标土匪学习开始之前就收集了有关源头土匪的数据。累积悔恨的最小趋同率已经确立,并提出了一个获得微减悔的新型转移学习算法。结果量化了来源域数据在非对称背景多武装匪徒方面对目标域学习的贡献。鉴于一般无法适应未知的平滑性,我们开发了一种数据驱动算法,实现接近最佳的统计保证(直至对数因素),同时根据另外的自相异假设,自动适应大量参数收集的未知参数参数。进行了模拟研究,以说明利用辅助来源域数据在目标领域学习的好处。