To leverage the power of big data from source tasks and overcome the scarcity of the target task samples, representation learning based on multi-task pretraining has become a standard approach in many applications. However, up until now, choosing which source tasks to include in the multi-task learning has been more art than science. In this paper, we give the first formal study on resource task sampling by leveraging the techniques from active learning. We propose an algorithm that iteratively estimates the relevance of each source task to the target task and samples from each source task based on the estimated relevance. Theoretically, we show that for the linear representation class, to achieve the same error rate, our algorithm can save up to a \textit{number of source tasks} factor in the source task sample complexity, compared with the naive uniform sampling from all source tasks. We also provide experiments on real-world computer vision datasets to illustrate the effectiveness of our proposed method on both linear and convolutional neural network representation classes. We believe our paper serves as an important initial step to bring techniques from active learning to representation learning.
翻译:为了从源任务中利用大数据的力量,并克服目标任务样本的稀缺性,基于多任务前培训的演示学习已成为许多应用中的标准方法。 但是,到目前为止,选择多任务学习中应包括的源任务比科学更是艺术。 在本文中,我们通过利用积极学习的技术,首次正式研究资源任务抽样。 我们建议一种算法,根据估计的相关性,迭接地估计每个源任务与目标任务的相关性和每个源任务样本。 从理论上讲,我们显示对于线性代表类别来说,为了达到同样的错误率,我们的算法可以节省出来源任务抽样复杂性中的一个系数,与所有源任务中天性统一的取样相比。我们还提供现实世界计算机视觉数据集实验,以说明我们拟议的线性和革命性神经网络代表类方法的有效性。 我们相信,我们的论文是将从积极学习到代表性学习的技术带来重要的第一步。