Given restrictions on the availability of data, active learning is the process of training a model with limited labeled data by selecting a core subset of an unlabeled data pool to label. Although selecting the most useful points for training is an optimization problem, the scale of deep learning data sets forces most selection strategies to employ efficient heuristics. Instead, we propose a new integer optimization problem for selecting a core set that minimizes the discrete Wasserstein distance from the unlabeled pool. We demonstrate that this problem can be tractably solved with a Generalized Benders Decomposition algorithm. Our strategy requires high-quality latent features which we obtain by unsupervised learning on the unlabeled pool. Numerical results on several data sets show that our optimization approach is competitive with baselines and particularly outperforms them in the low budget regime where less than one percent of the data set is labeled.
翻译:鉴于对数据提供的限制,积极学习是通过选择一个未贴标签的数据库的核心子集来给标签来培训一个带有有限标签数据的模型的过程。虽然选择最有用的培训点是一个优化问题,但深层学习数据集的规模迫使大多数选择战略采用高效的超自然学。相反,我们提议在选择一个核心组时出现新的整数优化问题,以尽可能减少离散的瓦塞尔斯坦与未贴标签的集合体的距离。我们证明,这个问题可以通过通用的班德斯分解算法来顺利解决。我们的战略要求通过在未贴标签的集合体上进行不受监督的学习而获得的高质量潜伏特征。几个数据集的量化结果显示,我们的优化方法与基线具有竞争力,特别是在低预算体系中将不到1%的数据集贴上标签,从而超越了基线。