We study the problem of transfer-learning in the setting of stochastic linear bandit tasks. We consider that a low dimensional linear representation is shared across the tasks, and study the benefit of learning this representation in the multi-task learning setting. Following recent results to design stochastic bandit policies, we propose an efficient greedy policy based on trace norm regularization. It implicitly learns a low dimensional representation by encouraging the matrix formed by the task regression vectors to be of low rank. Unlike previous work in the literature, our policy does not need to know the rank of the underlying matrix. We derive an upper bound on the multi-task regret of our policy, which is, up to logarithmic factors, of order $\sqrt{NdT(T+d)r}$, where $T$ is the number of tasks, $r$ the rank, $d$ the number of variables and $N$ the number of rounds per task. We show the benefit of our strategy compared to the baseline $Td\sqrt{N}$ obtained by solving each task independently. We also provide a lower bound to the multi-task regret. Finally, we corroborate our theoretical findings with preliminary experiments on synthetic data.
翻译:我们研究在设置随机线性匪帮任务时的转移-学习问题。 我们认为,任务之间共享一个低维线性代表,并研究在多任务学习环境中学习这种代表的好处。根据最近设计随机盗匪政策的结果,我们提出基于跟踪规范规范规范规范规范化的高效贪婪政策。通过鼓励任务回归矢量形成的矩阵为低级,我们隐含地学习了低维代表。与以往的文献工作不同,我们的政策不需要了解基础矩阵的级别。我们从我们的政策的多任务遗憾中获取了一个上限,即,直到对数因素,我们的政策是 $\sqrt{NdT(T+d)r} 的顺序。我们还提供了一个较低约束,即任务数量为1美元、1美元、1美元变量数和1轮数。我们展示了我们战略与基准 $T\ sqrt{N} 相比的好处。我们通过独立解决每项任务获得的多任务,我们还提供了一个较低约束性的初步实验数据。最后,我们还提供了一个较低约束到合成实验结果。