It is a consensus that small models perform quite poorly under the paradigm of self-supervised contrastive learning. Existing methods usually adopt a large off-the-shelf model to transfer knowledge to the small one via knowledge distillation. Despite their effectiveness, distillation-based methods may not be suitable for some resource-restricted scenarios due to the huge computational expenses of deploying a large model. In this paper, we study the issue of training self-supervised small models without distillation signals. We first evaluate the representation spaces of the small models and make two non-negligible observations: (i) small models can complete the pretext task without overfitting despite its limited capacity; (ii) small models universally suffer the problem of over-clustering. Then we verify multiple assumptions that are considered to alleviate the over-clustering phenomenon. Finally, we combine the validated techniques and improve the baseline of five small architectures with considerable margins, which indicates that training small self-supervised contrastive models is feasible even without distillation signals.
翻译:一种共识是,小型模型在自我监督的对比性学习范式下表现不佳。现有方法通常采用大型现成模型,通过知识蒸馏向小模型转让知识。尽管其效果是有效的,但蒸馏法可能不适用于某些资源限制的假设情况,因为部署大型模型的计算费用巨大。在本文件中,我们研究了在没有蒸馏信号的情况下培训自监督的小模型的问题。我们首先评估了小型模型的展示空间,并提出了两种不可忽略的观察:(一)小型模型尽管能力有限,但可以在不过度装配的情况下完成托辞任务;(二)小型模型普遍受到过度集群问题的影响。然后,我们核实了考虑缓解过度集群现象的多种假设。最后,我们结合了经过验证的技术,改进了有相当大的边际的5个小型结构的基线。这表明,即使没有蒸馏信号,培训小型自监督的对比模型也是可行的。