Recently, self-supervised learning methods have achieved remarkable success in visual pre-training task. By simply pulling the different augmented views of each image together or other novel mechanisms, they can learn much unsupervised knowledge and significantly improve the transfer performance of pre-training models. However, these works still cannot avoid the representation collapse problem, i.e., they only focus on limited regions or the extracted features on totally different regions inside each image are nearly the same. Generally, this problem makes the pre-training models cannot sufficiently describe the multi-grained information inside images, which further limits the upper bound of their transfer performance. To alleviate this issue, this paper introduces a simple but effective mechanism, called Exploring the Diversity and Invariance in Yourself E-DIY. By simply pushing the most different regions inside each augmented view away, E-DIY can preserve the diversity of extracted region-level features. By pulling the most similar regions from different augmented views of the same image together, E-DIY can ensure the robustness of region-level features. Benefited from the above diversity and invariance exploring mechanism, E-DIY maximally extracts the multi-grained visual information inside each image. Extensive experiments on downstream tasks demonstrate the superiority of our proposed approach, e.g., there are 2.1% improvements compared with the strong baseline BYOL on COCO while fine-tuning Mask R-CNN with the R50-C4 backbone and 1X learning schedule.
翻译:最近,自我监督的学习方法在视觉培训前任务中取得了显著的成功。简单地将每个图像的不同扩大观点汇集在一起或其他新机制,它们就可以学到许多不受监督的知识,大大改进培训前模式的转让性能。然而,这些工作仍然无法避免代表性崩溃问题,即它们只关注有限的区域,或每个图像中完全不同的区域的抽取特征几乎相同。一般来说,由于这一问题,培训前模式无法充分描述图像中的多分层信息,从而进一步限制其传输性能的上限。为缓解这一问题,本文件引入了一个简单而有效的机制,称为“探索自我电子DIY的多样性和差异”。通过简单地将每个扩大的视图中最不同的区域推开,电子DIY能够保护提取的区域特征的多样性。通过将最相似的区域从同一图像的不同扩大观点中拉开来,E-DIY能够确保区域层面的特征的稳健健。从上述多样性和差异探索机制中获益,E-DIY最强的探索机制,称为“探索自我电子DIY”的“多样性和不易动性”机制。通过将多层次的图像的升级展示了我们内部的图像基线,同时展示了REU的图像,并展示了我们内部的升级的升级的图像。