We consider the problem of contextual multi-armed bandits in the setting of hypothesis transfer learning. That is, we assume having access to a previously learned model on an unobserved set of contexts, and we leverage it in order to accelerate exploration on a new bandit problem. Our transfer strategy is based on a re-weighting scheme for which we show a reduction in the regret over the classic Linear UCB when transfer is desired, while recovering the classic regret rate when the two tasks are unrelated. We further extend this method to an arbitrary amount of source models, where the algorithm decides which model is preferred at each time step. Additionally we discuss an approach where a dynamic convex combination of source models is given in terms of a biased regularization term in the classic LinUCB algorithm. The algorithms and the theoretical analysis of our proposed methods substantiated by empirical evaluations on simulated and real-world data.
翻译:我们在假想转移学习时考虑背景多武装强盗的问题。 也就是说, 我们假设在一组未观测到的环境下可以使用一个先前学习过的模式, 并且我们利用这个模式来加速对新的土匪问题的探索。 我们的转移战略基于一个重新加权计划, 在需要转移时,我们对经典的Linear UCB表示的遗憾有所减少, 同时在两项任务不相干时恢复典型的遗憾率。 我们进一步将这一方法扩大到任意数量的源模型, 由算法决定哪一种模式在每一步都更可取。 此外, 我们讨论一种方法, 将源模型的动态组合用经典 LinUCB 算法中的有偏向的正规化术语来进行。 算法和我们提出的方法的理论分析得到了模拟和真实世界数据经验评估的证实。