Transformers and masked language modeling are quickly being adopted and explored in computer vision as vision transformers and masked image modeling (MIM). In this work, we argue that image token masking differs from token masking in text, due to the amount and correlation of tokens in an image. In particular, to generate a challenging pretext task for MIM, we advocate a shift from random masking to informed masking. We develop and exhibit this idea in the context of distillation-based MIM, where a teacher transformer encoder generates an attention map, which we use to guide masking for the student. We thus introduce a novel masking strategy, called attention-guided masking (AttMask), and we demonstrate its effectiveness over random masking for dense distillation-based MIM as well as plain distillation-based self-supervised learning on classification tokens. We confirm that AttMask accelerates the learning process and improves the performance on a variety of downstream tasks. We provide the implementation code at https://github.com/gkakogeorgiou/attmask.
翻译:在这项工作中,我们争辩说,图像标记面罩与文字上的象征面罩不同,因为图像上的象征面罩的数量和相关性。特别是,为了给MIM制造一个具有挑战性的借口任务,我们主张从随机遮罩转向知情遮罩。我们在基于蒸馏的MIM背景下开发和展示这一想法,教师变压器编码器生成了一个关注地图,用来指导学生的遮罩。我们因此引入了一种新颖的遮罩策略,称为注意引导遮罩(AttMask),我们展示了它的效力,而不是随机遮罩,用于浓密蒸馏基的MIM,以及基于纯蒸馏的自我监视的分类符号学习。我们确认,AttMask加快了学习过程,改进了下游任务的各种性能。我们在https://github.com/gkakogeiou/attmask提供了执行代码。