Contrastive methods have led a recent surge in the performance of self-supervised representation learning (SSL). Recent methods like BYOL or SimSiam purportedly distill these contrastive methods down to their essence, removing bells and whistles, including the negative examples, that do not contribute to downstream performance. These "non-contrastive" methods work surprisingly well without using negatives even though the global minimum lies at trivial collapse. We empirically analyze these non-contrastive methods and find that SimSiam is extraordinarily sensitive to dataset and model size. In particular, SimSiam representations undergo partial dimensional collapse if the model is too small relative to the dataset size. We propose a metric to measure the degree of this collapse and show that it can be used to forecast the downstream task performance without any fine-tuning or labels. We further analyze architectural design choices and their effect on the downstream performance. Finally, we demonstrate that shifting to a continual learning setting acts as a regularizer and prevents collapse, and a hybrid between continual and multi-epoch training can improve linear probe accuracy by as many as 18 percentage points using ResNet-18 on ImageNet.
翻译:最近的一些方法,例如 BYOL 或 SimSiam 将这些反比方法提炼到其精髓,去除钟声和哨声,包括负面例子,这些方法对下游的性能没有帮助。这些“非竞争性”方法在使用负效果的情况下效果极好,尽管全球最低要求处于微不足道的崩溃状态。我们从经验上分析这些非竞争性方法,发现SimSiam对数据集和模型大小特别敏感。特别是,如果模型与数据集大小相比太小,SimSiam 表示会发生局部的尺寸崩溃。我们提出了衡量这种崩溃程度的尺度,并表明可以用来在没有微调或标签的情况下预测下游任务性能。我们进一步分析建筑设计选择及其对下游性能的影响。最后,我们证明,转向持续学习的设置,作为常规,防止崩溃,以及连续和多角度训练之间的混合,可以提高线性探测的准确度,因为许多人使用ResNet 18 图像网络 的18 个百分点。