Leveraging the wealth of unlabeled data produced in recent years provides great potential for improving supervised models. When the cost of acquiring labels is high, probabilistic active learning methods can be used to greedily select the most informative data points to be labeled. However, for many large-scale problems standard greedy procedures become computationally infeasible and suffer from negligible model change. In this paper, we introduce a novel Bayesian batch active learning approach that mitigates these issues. Our approach is motivated by approximating the complete data posterior of the model parameters. While naive batch construction methods result in correlated queries, our algorithm produces diverse batches that enable efficient active learning at scale. We derive interpretable closed-form solutions akin to existing active learning procedures for linear models, and generalize to arbitrary models using random projections. We demonstrate the benefits of our approach on several large-scale regression and classification tasks.
翻译:利用近年来产生的大量未贴标签数据为改进受监督的模型提供了巨大的潜力。当获取标签的成本很高时,可以使用概率积极的学习方法贪婪地选择需要贴上标签的最丰富信息的数据点。然而,对于许多大规模问题,标准的贪婪程序在计算上变得不可行,并受到微不足道的模式变化的影响。在本文中,我们采用了一种小说贝叶西亚分批积极的学习方法,以缓解这些问题。我们的方法的动机是接近模型参数的完整数据后遗症。虽然幼稚的批量构建方法产生了相互关联的查询,但我们的算法产生了不同的批量,使得能够大规模有效地积极学习。我们得出类似于现有的线性模型现行积极学习程序的可解释的封闭式解决方案,并使用随机预测来概括任意模式。我们展示了我们在若干大规模回归和分类任务上的方法的好处。