Recent advances in self-supervised learning integrate Masked Modeling and Siamese Networks into a single framework to fully reap the advantages of both the two techniques. However, previous erasing-based masking scheme in masked image modeling is not originally designed for siamese networks. Existing approaches simply inherit the default loss design from previous siamese networks, and ignore the information loss and distance change after employing masking operation in the frameworks. In this paper, we propose a filling-based masking strategy called MixMask to prevent information loss due to the randomly erased areas of an image in vanilla masking method. We further introduce a dynamic loss function design with soft distance to adapt the integrated architecture and avoid mismatches between transformed input and objective in Masked Siamese ConvNets (MSCN). The dynamic loss distance is calculated according to the proposed mix-masking scheme. Extensive experiments are conducted on various datasets of CIFAR-100, Tiny-ImageNet and ImageNet-1K. The results demonstrate that the proposed framework can achieve better accuracy on linear probing, semi-supervised and {supervised finetuning}, which outperforms the state-of-the-art MSCN by a significant margin. We also show the superiority on downstream tasks of object detection and segmentation. Our source code is available at https://github.com/LightnessOfBeing/MixMask.
翻译:在自我监督的学习方面最近的进展,将隐蔽模型和暹粒网络整合成一个单一的框架,以充分利用这两种技术的优势。然而,以前在蒙面图像模型中以删除为基础的遮罩方案最初不是为硅状网络设计的。现有的办法只是继承了以前的硅状网络的默认损失设计,忽略了在框架内使用掩码操作后的信息损失和距离变化。在本文件中,我们提议了一个称为MixMask的填充遮罩战略,以防止由于香草遮罩方法中图像的随机删除区域而造成信息损失。我们进一步引入一个动态损失功能设计,具有软距离,以适应综合结构,避免在蒙面的暹脑ConvNet(MSCN)中转变的投入和目标之间的不匹配。动态损失距离是根据拟议的混合图案计算出来的。对CIDAR-100、Tini-ImageNet和图像Net-1K的各种数据集进行了广泛的实验。结果显示,拟议的框架可以在直线透镜标标、半覆压/图像网格中实现信息丢失功能的准确性损失功能设计,同时显示我们在MIStual-roduformforismforism 上的现有检测和Mismstrismformformldroforisml 的系统显示我们的重要数据结构。