It is widely believed that given the same labeling budget, active learning algorithms like uncertainty sampling achieve better predictive performance than passive learning (i.e. uniform sampling), albeit at a higher computational cost. Recent empirical evidence suggests that this added cost might be in vain, as uncertainty sampling can sometimes perform even worse than passive learning. While existing works offer different explanations in the low-dimensional regime, this paper shows that the underlying mechanism is entirely different in high dimensions: we prove for logistic regression that passive learning outperforms uncertainty sampling even for noiseless data and when using the uncertainty of the Bayes optimal classifier. Insights from our proof indicate that this high-dimensional phenomenon is exacerbated when the separation between the classes is small. We corroborate this intuition with experiments on 20 high-dimensional datasets spanning a diverse range of applications, from finance and histology to chemistry and computer vision.
翻译:人们广泛认为,根据同样的标签预算,积极的学习算法,如不确定性抽样等积极学习算法比被动学习(即统一抽样)的预测性能好,尽管计算成本较高。最近的实证证据表明,这种增加的成本可能是徒劳的,因为不确定性抽样有时甚至比被动学习更差。 虽然现有的作品在低维系统中提供了不同的解释,但本文表明,基本机制在高维方面完全不同:我们证明,在后勤方面,被动学习的抽样比无噪音数据以及使用贝耶斯最佳分类器的不确定性都好。 从我们的证据看,当类别之间的分离很小时,这种高维现象会加剧。 我们用20个高维数据集的实验证实了这种直觉,这些实验涉及从财务和神学到化学和计算机视觉等多种应用的20个高维数据集。