Annotating the right set of data amongst all available data points is a key challenge in many machine learning applications. Batch active learning is a popular approach to address this, in which batches of unlabeled data points are selected for annotation, while an underlying learning algorithm gets subsequently updated. Increasingly larger batches are particularly appealing in settings where data can be annotated in parallel, and model training is computationally expensive. A key challenge here is scale - typical active learning methods rely on diversity techniques, which select a diverse set of data points to annotate, from an unlabeled pool. In this work, we introduce Active Data Shapley (ADS) -- a filtering layer for batch active learning that significantly increases the efficiency of active learning by pre-selecting, using a linear time computation, the highest-value points from an unlabeled dataset. Using the notion of the Shapley value of data, our method estimates the value of unlabeled data points with regards to the prediction task at hand. We show that ADS is particularly effective when the pool of unlabeled data exhibits real-world caveats: noise, heterogeneity, and domain shift. We run experiments demonstrating that when ADS is used to pre-select the highest-ranking portion of an unlabeled dataset, the efficiency of state-of-the-art batch active learning methods increases by an average factor of 6x, while preserving performance effectiveness.
翻译:在所有可用数据点中注明正确的数据集是许多机器学习应用程序中的一个关键挑战。 批量积极学习是解决这一问题的流行方法, 即选择一批未贴标签的数据点进行批量注解, 并随后更新基本的学习算法。 越来越多的批量在数据可以同时附加注释且模型培训计算成本高昂的环境下特别吸引。 这里的一个关键挑战是规模 - 典型的主动学习方法依赖于多样性技术, 这些技术从一个未贴标签的集合中选择一组不同的数据点进行批量。 在这项工作中, 我们引入了“ 活跃数据洞穴( ADS) ” -- -- 一个用于批量积极学习的过滤层, 通过使用直线时间计算, 使未贴标签的数据集中的最大值点显著提高积极学习的效率。 我们的方法估算了未贴标签的数据点对于当前预测任务的价值。 我们显示, 当未贴标签的数据集合显示真实世界的洞穴时, ADS 将特别有效: 噪音, 保存性能率的过滤层, 使用高级数据序列, 将数据序列 演示前的磁带 。