Despite recent progress made by self-supervised methods in representation learning with residual networks, they still underperform supervised learning on the ImageNet classification benchmark, limiting their applicability in performance-critical settings. Building on prior theoretical insights from Mitrovic et al., 2021, we propose ReLICv2 which combines an explicit invariance loss with a contrastive objective over a varied set of appropriately constructed data views. ReLICv2 achieves 77.1% top-1 classification accuracy on ImageNet using linear evaluation with a ResNet50 architecture and 80.6% with larger ResNet models, outperforming previous state-of-the-art self-supervised approaches by a wide margin. Most notably, ReLICv2 is the first representation learning method to consistently outperform the supervised baseline in a like-for-like comparison using a range of standard ResNet architectures. Finally we show that despite using ResNet encoders, ReLICv2 is comparable to state-of-the-art self-supervised vision transformers.
翻译:尽管在与剩余网络进行代表学习方面最近通过自我监督的方法取得了进展,但它们仍然在图像网络分类基准的监督下学习方面表现不佳,限制了其在性能临界环境中的适用性。基于Mitrovic等人(2021年)先前的理论见解,我们提议RLICv2, 将明显易损和对比目标结合到一套不同的适当构建的数据视图中。 ReLICv2在图像网络上实现了77.1%的最高至1分类精确度, 使用了ResNet50结构的线性评价, 以及80.6%的大型ResNet模型, 表现得比以前最先进的自我监督方法要好得多。 最显著的是, ReLICv2 是第一个使用一系列标准ResNet结构进行类似比较, 持续超越受监督基线的代理学习方法。 最后,我们表明,尽管使用了ResNet encers, ReLICv2 与最先进的自我监督的视觉变异器相比, 但ReLICV2 却达到了77.1 % 。