We consider active learning for binary classification in the agnostic pool-based setting. The vast majority of works in active learning in the agnostic setting are inspired by the CAL algorithm where each query is uniformly sampled from the disagreement region of the current version space. The sample complexity of such algorithms is described by a quantity known as the disagreement coefficient which captures both the geometry of the hypothesis space as well as the underlying probability space. To date, the disagreement coefficient has been justified by minimax lower bounds only, leaving the door open for superior instance dependent sample complexities. In this work we propose an algorithm that, in contrast to uniform sampling over the disagreement region, solves an experimental design problem to determine a distribution over examples from which to request labels. We show that the new approach achieves sample complexity bounds that are never worse than the best disagreement coefficient-based bounds, but in specific cases can be dramatically smaller. From a practical perspective, the proposed algorithm requires no hyperparameters to tune (e.g., to control the aggressiveness of sampling), and is computationally efficient by means of assuming access to an empirical risk minimization oracle (without any constraints). Empirically, we demonstrate that our algorithm is superior to state of the art agnostic active learning algorithms on image classification datasets.
翻译:我们考虑在基于不可知库的环境下积极学习二进制分类。 在不可知性环境中积极学习的绝大多数作品都受到CAL算法的启发,其中每个查询均从当前版本空间的分歧区域进行统一抽样。这种算法的样本复杂性被一个称为分歧系数的数量所描述,该系数既能捕捉假设空间的几何,又能捕捉潜在的概率空间。迄今为止,分歧系数仅以小麦克斯较低的界限为根据,为高级实例样本的复杂程度打开了大门。在这项工作中,我们建议一种算法,与对分歧区域的统一抽样相比,解决一个实验设计问题,以确定从哪些例子上分配到要求标签。我们表明,新方法的复杂程度从未比基于系数的最佳界限差得多,但在具体情况下则可能小得多。从实际角度看,拟议的算法不需要超参数来调和(例如,控制抽样的侵略性),并且通过假设获得实验风险最小化或积极性算法的方法来进行计算效率。 我们的算法是,我们从实验性角度上学习最高级的算法。