We study unsupervised data selection for semi-supervised learning (SSL), where a large-scale unlabeled dataset is available and a small subset of data is budgeted for label acquisition. Existing SSL methods focus on learning a model that effectively integrates information from given small labeled data and large unlabeled data, whereas we focus on selecting the right data to annotate for SSL without requiring any label or task information. Intuitively, instances to be labeled shall collectively have maximum diversity and coverage for downstream tasks, and individually have maximum information propagation utility for SSL. We formalize these concepts in a three-step data-centric SSL method that improves FixMatch in stability and accuracy by 8% on CIFAR-10 (0.08% labeled) and 14% on ImageNet-1K (0.2% labeled). It is also a universal framework that works with various SSL methods, delivering consistent performance gains. Our work demonstrates that small computation spent on carefully selecting data for annotation brings big annotation efficiency and model performance gain without changing the learning pipeline. Our completely unsupervised data selection can be easily extended to other weakly supervised learning settings.
翻译:我们研究的是用于半监督学习的不受监督的数据选择(SSL),那里有大型的无标签数据集,有少量数据用于获取标签。现有的SSL方法侧重于学习一种能够有效地将特定小标签数据和大无标签数据的信息整合起来的模型,而我们则侧重于在不需要任何标签或任务信息的情况下为SSL选择正确的数据进行批注。自觉地,贴标签的事例应该为下游任务提供最大的多样性和覆盖范围,并个别地为SSL提供最大的信息传播工具。我们将这些概念正式化为三步以数据为中心的SSL方法,在CIFAR-10(0.08%贴上标签)和图像Net-1K(0.2%贴上标签)上使固定匹配的稳定性和准确性提高8%。它也是一个通用框架,与各种SSL方法合作,实现一致的绩效增益。我们的工作表明,在仔细选择说明数据时花费的小计算可以带来较大的说明效率和模型性能,而不改变学习管道。我们完全未受监督的数据选择可以很容易扩展到其他薄弱的学习环境。