We study unsupervised data selection for semi-supervised learning (SSL), where a large-scale unlabeled data is available and a small subset of data is budgeted for label acquisition. Existing SSL methods focus on learning a model that effectively integrates information from given small labeled data and large unlabeled data, whereas we focus on selecting the right data for SSL without any label or task information, in an also stark contrast to supervised data selection for active learning. Intuitively, instances to be labeled shall collectively have maximum diversity and coverage for downstream tasks, and individually have maximum information propagation utility for SSL. We formalize these concepts in a three-step data-centric SSL method that improves FixMatch in stability and accuracy by 8% on CIFAR-10 (0.08% labeled) and 14% on ImageNet-1K (0.2% labeled). Our work demonstrates that a small compute spent on careful labeled data selection brings big annotation efficiency and model performance gain without changing the learning pipeline. Our completely unsupervised data selection can be easily extended to other weakly supervised learning settings.
翻译:我们研究的是用于半监督学习的不受监督的数据选择(SSL),那里有大规模无标签数据,有少量数据用于获取标签。现有的SSL方法侧重于学习一种模型,有效地整合来自给定的小标签数据和大无标签数据的信息,而我们则侧重于在没有任何标签或任务信息的情况下为SSL选择正确的数据,这与用于积极学习的受监督数据选择形成鲜明对比。直觉地说,要贴上标签的事例,对于下游任务应具有最大的多样性和覆盖面,而对于SSL,则个别地拥有最大的信息传播功能。我们将这些概念正式化为三步以数据为中心的SSL方法,在CIFAR-10(0.08%贴上标签)和图像Net-1K(0.2%贴上标签)上将固定和准确率提高8%。我们的工作表明,在谨慎标签数据选择方面花费的少量计算,在不改变学习管道的情况下,会带来很大的注意效率和模型性能增益。我们完全未超超过的数据选择可以很容易推广到其他薄弱的学习环境。