This paper is concerned with self-supervised learning for small models. The problem is motivated by our empirical studies that while the widely used contrastive self-supervised learning method has shown great progress on large model training, it does not work well for small models. To address this problem, we propose a new learning paradigm, named SElf-SupErvised Distillation (SEED), where we leverage a larger network (as Teacher) to transfer its representational knowledge into a smaller architecture (as Student) in a self-supervised fashion. Instead of directly learning from unlabeled data, we train a student encoder to mimic the similarity score distribution inferred by a teacher over a set of instances. We show that SEED dramatically boosts the performance of small networks on downstream tasks. Compared with self-supervised baselines, SEED improves the top-1 accuracy from 42.2% to 67.6% on EfficientNet-B0 and from 36.3% to 68.2% on MobileNet-v3-Large on the ImageNet-1k dataset.
翻译:本文关注的是小型模型的自我监督学习。 这个问题是由我们的经验研究引起的。 我们的经验研究认为,虽然广泛使用的对比式自我监督学习方法在大型模型培训方面取得了很大进展,但对于小型模型来说效果不大。 为了解决这个问题,我们提出了一个新的学习模式,名为Self-SupErvised蒸馏(SEED),我们利用一个更大的网络(作为教师),以自我监督的方式将其代表性知识转移到一个较小的架构(作为学生)中。我们不是直接从无标签数据中学习,而是训练一个学生编码器来模仿教师在一系列实例中推断的相似性评分分布。我们显示,SECD大大提升了下游任务小型网络的性能。与自我监督的基线相比,SECD提高了上层-1精度,从42.2%提高到67.6%,在高效Net-B0上从36.3%提高到68.2%,在图像网络-1k数据集上的移动网络-V3-Large上从36.2%提高到68.2%。。