Self-supervised learning solves pretext prediction tasks that do not require annotations to learn feature representations. For vision tasks, pretext tasks such as predicting rotation, solving jigsaw are solely created from the input data. Yet, predicting this known information helps in learning representations useful for downstream tasks. However, recent works have shown that wider and deeper models benefit more from self-supervised learning than smaller models. To address the issue of self-supervised pre-training of smaller models, we propose Distill-on-the-Go (DoGo), a self-supervised learning paradigm using single-stage online knowledge distillation to improve the representation quality of the smaller models. We employ deep mutual learning strategy in which two models collaboratively learn from each other to improve one another. Specifically, each model is trained using self-supervised learning along with distillation that aligns each model's softmax probabilities of similarity scores with that of the peer model. We conduct extensive experiments on multiple benchmark datasets, learning objectives, and architectures to demonstrate the potential of our proposed method. Our results show significant performance gain in the presence of noisy and limited labels and generalization to out-of-distribution data.
翻译:自我监督的学习解决了不需要说明来学习特征表现的借口预测任务。 对于愿景任务,预测轮换、解决拼图等托辞任务完全是从输入数据中产生的。然而,预测这种已知信息有助于学习对下游任务有用的表现。然而,最近的工作表明,更广泛和更深层次的模式更多地受益于自我监督的学习而不是较小的模型。为了解决自我监督的小模型预培训前自我监督的概率问题,我们提议“现成”(DoGo)是一种自我监督的学习模式,它使用单阶段在线知识蒸馏来提高较小模型的代表性质量。我们采用了深层次的相互学习战略,其中两个模型相互协作学习以相互改进。具体地说,每个模型都经过培训,使用自我监督的学习以及将每个模型类似分数的软成形概率与同级模型相匹配。我们在多个基准数据集、学习目标和架构上进行了广泛的实验,以展示我们拟议方法的潜力。我们的成果显示,在高压和有限标签和通用数据上取得了显著的绩效。