Active Learning is a very common yet powerful framework for iteratively and adaptively sampling subsets of the unlabeled sets with a human in the loop with the goal of achieving labeling efficiency. Most real world datasets have imbalance either in classes and slices, and correspondingly, parts of the dataset are rare. As a result, there has been a lot of work in designing active learning approaches for mining these rare data instances. Most approaches assume access to a seed set of instances which contain these rare data instances. However, in the event of more extreme rareness, it is reasonable to assume that these rare data instances (either classes or slices) may not even be present in the seed labeled set, and a critical need for the active learning paradigm is to efficiently discover these rare data instances. In this work, we provide an active data discovery framework which can mine unknown data slices and classes efficiently using the submodular conditional gain and submodular conditional mutual information functions. We provide a general algorithmic framework which works in a number of scenarios including image classification and object detection and works with both rare classes and rare slices present in the unlabeled set. We show significant accuracy and labeling efficiency gains with our approach compared to existing state-of-the-art active learning approaches for actively discovering these rare classes and slices.
翻译:积极学习是一个非常常见但又强有力的框架,用于对未贴标签的数据集进行迭代和适应性抽样,在环状中以人为对象,以实现标签效率。大多数真实的世界数据集在类别和切片上都存在不平衡,相应地,数据集的某些部分是稀有的。因此,在设计积极学习方法以挖掘这些稀有数据实例方面做了大量工作。大多数方法假定可以获取含有这些罕见数据实例的种子集。但是,如果出现更极端的稀有情况,可以合理地假定,这些稀有数据实例(类或切片)甚至可能没有出现在种子标签的数据集中,而积极学习模式的关键需要是有效地发现这些稀有数据实例。在这项工作中,我们提供了一个积极的数据发现框架,可以利用亚质有条件收益和亚质有条件的相互信息功能,对未知数据切片有效挖掘未知数据切片和课程。我们提供了一个一般的算法框架,在包括图像分类和对象检测在内的一些假设情况下,对在未贴标签的数据集中的稀有类和稀有切片进行工作。我们展示了显著的精确性的方法,并将现有效率成果贴上,以便积极探索。我们的现有阶段学习。我们展示了重要的稀有的稀有方法。