Recently, significant progress has been made in masked image modeling to catch up to masked language modeling. However, unlike words in NLP, the lack of semantic decomposition of images still makes masked autoencoding (MAE) different between vision and language. In this paper, we explore a potential visual analogue of words, i.e., semantic parts, and we integrate semantic information into the training process of MAE by proposing a Semantic-Guided Masking strategy. Compared to widely adopted random masking, our masking strategy can gradually guide the network to learn various information, i.e., from intra-part patterns to inter-part relations. In particular, we achieve this in two steps. 1) Semantic part learning: we design a self-supervised part learning method to obtain semantic parts by leveraging and refining the multi-head attention of a ViT-based encoder. 2) Semantic-guided MAE (SemMAE) training: we design a masking strategy that varies from masking a portion of patches in each part to masking a portion of (whole) parts in an image. Extensive experiments on various vision tasks show that SemMAE can learn better image representation by integrating semantic information. In particular, SemMAE achieves 84.5% fine-tuning accuracy on ImageNet-1k, which outperforms the vanilla MAE by 1.4%. In the semantic segmentation and fine-grained recognition tasks, SemMAE also brings significant improvements and yields the state-of-the-art performance.
翻译:最近,在蒙面图像建模以赶上遮面语言建模方面取得了显著进展。然而,与NLP中的文字不同的是,图像的语义分解缺乏语义分解仍然使得遮面自动编码(MAE)在视觉和语言之间有所差异。在本论文中,我们探索了一种潜在的视觉模拟词词,即语义部分,我们通过提出一种语义指南的遮罩策略,将语义信息纳入MAE的培训过程。与广泛采用的随机掩码改进相比,我们的遮面战略可以逐渐指导网络学习各种信息,即从部内模式到部分之间的关系。特别是,我们在两个步骤中实现了这个目的。 1) 语义部分学习了:我们设计了一种自我监督的部分学习方法,通过利用和完善基于 ViT 的代言方的多头注意力,将语义信息纳入MAE(SemMAE) 培训:我们设计一种国家掩码战略,从隐藏图像的精度部分到图像的精度部分,通过将SemMA- 图像的精度整合一个部分。