We study the problem of online multi-task learning where the tasks are performed within similar but not necessarily identical multi-armed bandit environments. In particular, we study how a learner can improve its overall performance across multiple related tasks through robust transfer of knowledge. While an upper confidence bound (UCB)-based algorithm has recently been shown to achieve nearly-optimal performance guarantees in a setting where all tasks are solved concurrently, it remains unclear whether Thompson sampling (TS) algorithms, which have superior empirical performance in general, share similar theoretical properties. In this work, we present a TS-type algorithm for a more general online multi-task learning protocol, which extends the concurrent setting. We provide its frequentist analysis and prove that it is also nearly-optimal using a novel concentration inequality for multi-task data aggregation at random stopping times. Finally, we evaluate the algorithm on synthetic data and show that the TS-type algorithm enjoys superior empirical performance in comparison with the UCB-based algorithm and a baseline algorithm that performs TS for each individual task without transfer.
翻译:我们研究在类似但不一定完全相同的多武装强盗环境中执行任务的在线多任务学习问题。我们特别研究一个学习者如何通过强有力的知识转让来改善其在多个相关任务中的总体业绩。虽然最近显示一个基于上层信任(UCB)的算法能够在同时解决所有任务的环境中实现几乎最佳的绩效保障,但仍然不清楚Thompson抽样算法(TS)是否总体具有优异的经验性能,具有类似的理论属性。在这项工作中,我们为更一般性的在线多任务学习协议提出一种TS型算法,以扩展并行设置。我们提供其经常分析,并证明它也几乎最理想地使用一种新颖的多任务数据集合集中集中的浓度不平等来随机停留时间。最后,我们评估合成数据的算法,并表明TS型算法与基于UCB的算法和一种基线算法相比具有较高的实绩。