It is a consensus that small models perform quite poorly under the paradigm of self-supervised contrastive learning. Existing methods usually adopt a large off-the-shelf model to transfer knowledge to the small one via distillation. Despite their effectiveness, distillation-based methods may not be suitable for some resource-restricted scenarios due to the huge computational expenses of deploying a large model. In this paper, we study the issue of training self-supervised small models without distillation signals. We first evaluate the representation spaces of the small models and make two non-negligible observations: (i) the small models can complete the pretext task without overfitting despite their limited capacity and (ii) they universally suffer the problem of over clustering. Then we verify multiple assumptions that are considered to alleviate the over-clustering phenomenon. Finally, we combine the validated techniques and improve the baseline performances of five small architectures with considerable margins, which indicates that training small self-supervised contrastive models is feasible even without distillation signals. The code is available at \textit{https://github.com/WOWNICE/ssl-small}.
翻译:一种共识是,小型模型在自我监督的对比性学习范式下表现不佳。现有方法通常采用大型现成模型,通过蒸馏向小模型转让知识。尽管其效果有效,但蒸馏法可能不适用于某些资源受限制的假设情况,因为部署大型模型的计算费用巨大。在本文中,我们研究了在没有蒸馏信号的情况下培训自监督的小模型的问题。我们首先评估了小型模型的展示空间,并提出了两种不可忽略的观察:(一)小型模型尽管能力有限,但可以不过度装配,完成托辞任务;(二)这些小模型普遍受到过度集束问题的影响。然后,我们核实了考虑缓解超集束现象的多种假设。最后,我们结合了经过验证的技术,改进了5个规模小结构的基线性能,但有相当大的边际,这表明即使没有蒸馏信号,培训小型自监督的对比模型也是可行的。该代码可在\ textit{http://github.com/WOWNICE/sl-smal}