Studies on self-supervised visual representation learning (SSL) improve encoder backbones to discriminate training samples without labels. While CNN encoders via SSL achieve comparable recognition performance to those via supervised learning, their network attention is under-explored for further improvement. Motivated by the transformers that explore visual attention effectively in recognition scenarios, we propose a CNN Attention REvitalization (CARE) framework to train attentive CNN encoders guided by transformers in SSL. The proposed CARE framework consists of a CNN stream (C-stream) and a transformer stream (T-stream), where each stream contains two branches. C-stream follows an existing SSL framework with two CNN encoders, two projectors, and a predictor. T-stream contains two transformers, two projectors, and a predictor. T-stream connects to CNN encoders and is in parallel to the remaining C-Stream. During training, we perform SSL in both streams simultaneously and use the T-stream output to supervise C-stream. The features from CNN encoders are modulated in T-stream for visual attention enhancement and become suitable for the SSL scenario. We use these modulated features to supervise C-stream for learning attentive CNN encoders. To this end, we revitalize CNN attention by using transformers as guidance. Experiments on several standard visual recognition benchmarks, including image classification, object detection, and semantic segmentation, show that the proposed CARE framework improves CNN encoder backbones to the state-of-the-art performance.
翻译:在自我监督的视觉代表学习(SSL)的研究中,自我监督的视觉代表学习(SSL)改进了编码主干网,以区别没有标签的培训样本。尽管通过SSL的CNN编码员通过SSL实现了与那些通过监管的学习获得可比的识别性能,但其网络关注度的探索还没有得到进一步的改善。受在认知情景中有效探索视觉关注的变压器的激励,我们建议CNN注意的视觉代表学习(CARE)框架,以培训由变压器指导的变压器指导的CNN成色器。拟议的CARE框架包括CNN流(C流)和变压器(T流),每条流都包含两个分支。CRSL遵循现有的SL框架,由两个CNNC、两个投影机、两个投影机和预测器组成。T-RISL的变压框架包含两个变压的SL框架,这些变压的SLSL的功能通过SLSL的图像演示到SL的图像定位定位定位模型,这些变压的SLSL 将SL 演示到SLSL 演示的图像的图像定位定位定位模型, 将这些变压到SL 演示到SLSLSLSLSL 的演示到SLSL 的图像的演示到SLD的图像的图像的图像的图像的校。