高效批量积极学习数据模型估价 (Data Shapley Valuation for Efficient Batch Active Learning)

Annotating the right set of data amongst all available data points is a key challenge in many machine learning applications. Batch active learning is a popular approach to address this, in which batches of unlabeled data points are selected for annotation, while an underlying learning algorithm gets subsequently updated. Increasingly larger batches are particularly appealing in settings where data can be annotated in parallel, and model training is computationally expensive. A key challenge here is scale - typical active learning methods rely on diversity techniques, which select a diverse set of data points to annotate, from an unlabeled pool. In this work, we introduce Active Data Shapley (ADS) -- a filtering layer for batch active learning that significantly increases the efficiency of active learning by pre-selecting, using a linear time computation, the highest-value points from an unlabeled dataset. Using the notion of the Shapley value of data, our method estimates the value of unlabeled data points with regards to the prediction task at hand. We show that ADS is particularly effective when the pool of unlabeled data exhibits real-world caveats: noise, heterogeneity, and domain shift. We run experiments demonstrating that when ADS is used to pre-select the highest-ranking portion of an unlabeled dataset, the efficiency of state-of-the-art batch active learning methods increases by an average factor of 6x, while preserving performance effectiveness.

翻译：在所有可用数据点中注明正确的数据集是许多机器学习应用程序中的一个关键挑战。批量积极学习是解决这一问题的流行方法, 即选择一批未贴标签的数据点进行批量注解, 并随后更新基本的学习算法。越来越多的批量在数据可以同时附加注释且模型培训计算成本高昂的环境下特别吸引。这里的一个关键挑战是规模 - 典型的主动学习方法依赖于多样性技术, 这些技术从一个未贴标签的集合中选择一组不同的数据点进行批量。在这项工作中, 我们引入了“ 活跃数据洞穴( ADS) ” -- -- 一个用于批量积极学习的过滤层, 通过使用直线时间计算, 使未贴标签的数据集中的最大值点显著提高积极学习的效率。我们的方法估算了未贴标签的数据点对于当前预测任务的价值。我们显示, 当未贴标签的数据集合显示真实世界的洞穴时, ADS 将特别有效: 噪音, 保存性能率的过滤层, 使用高级数据序列, 将数据序列演示前的磁带。

相关内容

主动学习

关注 241

主动学习是机器学习（更普遍的说是人工智能）的一个子领域，在统计学领域也叫查询学习、最优实验设计。“学习模块”和“选择策略”是主动学习算法的2个基本且重要的模块。主动学习是“一种学习方法，在这种方法中，学生会主动或体验性地参与学习过程，并且根据学生的参与程度，有不同程度的主动学习。” （Bonwell＆Eison 1991）Bonwell＆Eison（1991）指出：“学生除了被动地听课以外，还从事其他活动。” 在高等教育研究协会（ASHE）的一份报告中，作者讨论了各种促进主动学习的方法。他们引用了一些文献，这些文献表明学生不仅要做听，还必须做更多的事情才能学习。他们必须阅读，写作，讨论并参与解决问题。此过程涉及三个学习领域，即知识，技能和态度（KSA）。这种学习行为分类法可以被认为是“学习过程的目标”。特别是，学生必须从事诸如分析，综合和评估之类的高级思维任务。

专知会员服务

39+阅读 · 2020年11月3日

不可错过！UIUC最新《统计强化学习》课程！

专知会员服务

54+阅读 · 2020年9月7日

【WSDM2020】小数据学习，124页ppt，Learning with Small Data，宾夕法尼亚州立大学

专知会员服务

137+阅读 · 2020年2月6日

【IJCAI 2019 | tutorial】大数据中的小数据挑战Small Data Challenges in Big Data Era ，华为|Guo-Jun Qi，柯达|Jiebo Luo

专知会员服务

30+阅读 · 2019年11月30日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation