Given an unlabeled dataset and an annotation budget, we study how to selectively label a fixed number of instances so that semi-supervised learning (SSL) on such a partially labeled dataset is most effective. We focus on selecting the right data to label, in addition to usual SSL's propagating labels from labeled data to the rest unlabeled data. This instance selection task is challenging, as without any labeled data we do not know what the objective of learning should be. Intuitively, no matter what the downstream task is, instances to be labeled must be representative and diverse: The former would facilitate label propagation to unlabeled data, whereas the latter would ensure coverage of the entire dataset. We capture this idea by selecting cluster prototypes, either in a pretrained feature space, or along with feature optimization, both without labels. Our unsupervised selective labeling consistently improves SSL methods over state-of-the-art active learning given labeled data, by 8 to 25 times in label efficiency. For example, it boosts FixMatch by 10% (14%) in accuracy on CIFAR-10 (ImageNet-1K) with 0.08% (0.2%) labeled data, demonstrating that small computation spent on selecting what data to label brings significant gain especially under a low annotation budget. Our work sets a new standard for practical and efficient SSL.
翻译:鉴于一个未贴标签的数据集和注释预算,我们研究如何有选择地标签固定的事例数量,使半监督的学习(SSL)在这样一个部分标签的数据集上最为有效。我们注重选择正确的标签数据。除了通常的SSL将标签标签标签从标签数据向其余未贴标签的数据传播出去之外,我们注重选择正确的标签数据。这个实例选择任务具有挑战性,因为没有标签的数据,我们不知道学习的目标是什么。直觉地说,不管下游任务是什么,标签必须具有代表性和多样性:前者将标签标签传播到未贴标签的数据中,而后者将确保整个数据集的覆盖。我们通过在没有标签的情况下在事先训练的功能空间或与功能优化一起选择集成原型来捕捉这一想法。我们未经监督的选择性标签持续改进了SLSL方法,而不是根据标签效率8至25倍。例如,它能将固定的Match 10% (14 %) 提升到不切实际的S-10 标签的精确度, 特别是根据已花费的S-10 的标签,将多少次(I) 的计算结果带来一个相当的精确性的数据。