Self-supervised learning has shown its great potential to extract powerful visual representations without human annotations. Various works are proposed to deal with self-supervised learning from different perspectives: (1) contrastive learning methods (e.g., MoCo, SimCLR) utilize both positive and negative samples to guide the training direction; (2) asymmetric network methods (e.g., BYOL, SimSiam) get rid of negative samples via the introduction of a predictor network and the stop-gradient operation; (3) feature decorrelation methods (e.g., Barlow Twins, VICReg) instead aim to reduce the redundancy between feature dimensions. These methods appear to be quite different in the designed loss functions from various motivations. The final accuracy numbers also vary, where different networks and tricks are utilized in different works. In this work, we demonstrate that these methods can be unified into the same form. Instead of comparing their loss functions, we derive a unified formula through gradient analysis. Furthermore, we conduct fair and detailed experiments to compare their performances. It turns out that there is little gap between these methods, and the use of momentum encoder is the key factor to boost performance. From this unified framework, we propose UniGrad, a simple but effective gradient form for self-supervised learning. It does not require a memory bank or a predictor network, but can still achieve state-of-the-art performance and easily adopt other training strategies. Extensive experiments on linear evaluation and many downstream tasks also show its effectiveness. Code shall be released.
翻译:自我监督的学习表明,它具有巨大潜力,可以在没有人文说明的情况下提取强大的视觉形象,因此,建议开展各种工作,从不同角度处理自我监督的学习:(1) 对比式学习方法(如MoCo、SimCLR)使用正面和负面的样本来指导培训方向;(2) 不对称网络方法(如BYOL、SimSiam)通过引入预测网络和停顿式操作来清除负面样本;(3) 特征设计方法(如Barlow Twins、ICRCReg)相反,目的是减少特征层面之间的冗余。这些方法在设计的损失函数中与各种动机之间似乎有很大不同。最后准确数字也各不相同,在不同工作中使用了不同的网络和技巧。在这项工作中,我们证明这些方法可以统一到相同的形式。我们通过梯度分析来得出一个统一的公式。此外,我们进行许多简单的实验来比较其绩效。我们发现,这些方法之间没有什么差距,而使用动力编码的编码显示各种目的损失功能;最后的准确性数字也不同,从一个关键的指数来推导出一个更高的业绩。