In this paper, we are interested in understanding self-supervised pretraining through studying the capability that self-supervised representation pretraining methods learn part-aware representations. The study is mainly motivated by that random views, used in contrastive learning, and random masked (visible) patches, used in masked image modeling, are often about object parts. We explain that contrastive learning is a part-to-whole task: the projection layer hallucinates the whole object representation from the object part representation learned from the encoder, and that masked image modeling is a part-to-part task: the masked patches of the object are hallucinated from the visible patches. The explanation suggests that the self-supervised pretrained encoder is required to understand the object part. We empirically compare the off-the-shelf encoders pretrained with several representative methods on object-level recognition and part-level recognition. The results show that the fully-supervised model outperforms self-supervised models for object-level recognition, and most self-supervised contrastive learning and masked image modeling methods outperform the fully-supervised method for part-level recognition. It is observed that the combination of contrastive learning and masked image modeling further improves the performance.
翻译:在本文中,我们有兴趣了解自我监督的预培训,方法是研究自我监督的预培训方法学会部分觉悟的表现的能力。研究的动力主要在于用于对比性学习的随机观点,以及用于蒙面图像建模的随机掩码(可见的)补丁,往往与对象部分有关。我们解释说,对比学习是一种全方位的任务:投影层幻觉将从从编码器中学习到的物体部分的表示部分的整个物体的表示形式描述成全,而蒙面图像模型是部分任务:该对象的蒙面部分是从可见的补丁中幻化成的。解释表明,需要自我监督的预训练的编码器来理解对象部分。我们从经验上比较现成的编码器,先经过关于目标级别识别和部分识别的若干具有代表性的方法培训。结果显示,完全监督的模型将自我监督的图像建模模型形成出目标级别识别的自我监督模型,而大多数自我监督的图像模型则由可见的补全的图像化化后,这是经过自我监督的模型化化化化的模型的升级化的模型的学习方法,这是经过全面改进的模型的模型的模型的升级的模型的模型的模型的模型。