Deep supervision, which involves extra supervisions to the intermediate features of a neural network, was widely used in image classification in the early deep learning era since it significantly reduces the training difficulty and eases the optimization like avoiding gradient vanish over the vanilla training. Nevertheless, with the emergence of normalization techniques and residual connection, deep supervision in image classification was gradually phased out. In this paper, we revisit deep supervision for masked image modeling (MIM) that pre-trains a Vision Transformer (ViT) via a mask-and-predict scheme. Experimentally, we find that deep supervision drives the shallower layers to learn more meaningful representations, accelerates model convergence, and expands attention diversities. Our approach, called DeepMIM, significantly boosts the representation capability of each layer. In addition, DeepMIM is compatible with many MIM models across a range of reconstruction targets. For instance, using ViT-B, DeepMIM on MAE achieves 84.2 top-1 accuracy on ImageNet, outperforming MAE by +0.6. By combining DeepMIM with a stronger tokenizer CLIP, our model achieves state-of-the-art performance on various downstream tasks, including image classification (85.6 top-1 accuracy on ImageNet-1K, outperforming MAE-CLIP by +0.8), object detection (52.8 APbox on COCO) and semantic segmentation (53.1 mIoU on ADE20K). Code and models are available at https://github.com/OliverRensu/DeepMIM.
翻译:在早期深层学习时代,通过对神经网络的中间特征进行额外监督,在图像分类中广泛使用了深度监督,因为深层监督大大减少了培训难度,使优化更加容易,例如避免梯度在香草培训中消失。然而,随着正常化技术和残余连接的出现,对图像分类的深度监督逐渐停止。在本文中,我们重新审视了对隐蔽图像模型(MIM)的深度监督,即:通过遮罩和预设办法,对隐蔽图像转换器(VIT)进行跟踪前的深度监督。在实验中,我们发现深层监督驱动浅层学习更有意义的表现,加速模型趋同,扩大注意力分散。我们称为DeepMIM的方法,大大提升了每个层的代表性能力。此外,深层MIM与许多MIM模型在一系列重建目标中是兼容的。例如,利用VIT-B,对远层图像转换器(ViT-B)在图像网络上达到84.20R级的准确度,在+0.6中表现MAE。通过深MIM与更强的标记器 CLIP目标组合,在SO-LI AS-LO AS AS AS IM IM AS 上,在高级和图像分类中,在最高和图像探测 IM IS-ILILI IM IM IM IM II 5-SO ASU ASO II II 5-S-S-l) 上实现了。</s>