Deep learning in general domains has constantly been extended to domain-specific tasks requiring the recognition of fine-grained characteristics. However, real-world applications for fine-grained tasks suffer from two challenges: a high reliance on expert knowledge for annotation and necessity of a versatile model for various downstream tasks in a specific domain (e.g., prediction of categories, bounding boxes, or pixel-wise annotations). Fortunately, the recent self-supervised learning (SSL) is a promising approach to pretrain a model without annotations, serving as an effective initialization for any downstream tasks. Since SSL does not rely on the presence of annotation, in general, it utilizes the large-scale unlabeled dataset, referred to as an open-set. In this sense, we introduce a novel Open-Set Self-Supervised Learning problem under the assumption that a large-scale unlabeled open-set is available, as well as the fine-grained target dataset, during a pretraining phase. In our problem setup, it is crucial to consider the distribution mismatch between the open-set and target dataset. Hence, we propose SimCore algorithm to sample a coreset, the subset of an open-set that has a minimum distance to the target dataset in the latent space. We demonstrate that SimCore significantly improves representation learning performance through extensive experimental settings, including eleven fine-grained datasets and seven open-sets in various downstream tasks.
翻译:深度学习在一般领域中被不断扩展到需要识别细粒度特征的特定领域任务中。然而,细粒度任务的实际应用存在两个挑战:对注释的专家知识的高度依赖以及需要适用于特定领域中各种下游任务(例如,类别预测、边界框或像素级注释)。幸运的是,最近的自我监督学习是一种无需注释即可预训练模型的有前途的方法,可作为任何下游任务的有效初始化。由于自我监督学习不依赖于注释的存在,通常情况下,它利用大规模未标记的数据集(称为开集)。在这种意义上,我们引入了一个新颖的开放式自我监督学习问题,假设在预训练阶段可用大规模未标记的开集,以及细粒度目标数据集。在我们的问题设置中,考虑到开集和目标数据集之间的分布不匹配是至关重要的。因此,我们提出了SimCore算法,用于采样核心集合,该子集在潜在空间中与目标数据集的距离最小。通过广泛的实验设置,包括11个细粒度数据集和7个开放式数据集的各种下游任务,我们证明了SimCore显著提高了表示学习性能。