Vision Transformers (ViT) become widely-adopted architectures for various vision tasks. Masked auto-encoding for feature pretraining and multi-scale hybrid convolution-transformer architectures can further unleash the potentials of ViT, leading to state-of-the-art performances on image classification, detection and semantic segmentation. In this paper, our ConvMAE framework demonstrates that multi-scale hybrid convolution-transformer can learn more discriminative representations via the mask auto-encoding scheme. However, directly using the original masking strategy leads to the heavy computational cost and pretraining-finetuning discrepancy. To tackle the issue, we adopt the masked convolution to prevent information leakage in the convolution blocks. A simple block-wise masking strategy is proposed to ensure computational efficiency. We also propose to more directly supervise the multi-scale features of the encoder to boost multi-scale features. Based on our pretrained ConvMAE models, ConvMAE-Base improves ImageNet-1K finetuning accuracy by 1.4% compared with MAE-Base. On object detection, ConvMAE-Base finetuned for only 25 epochs surpasses MAE-Base fined-tuned for 100 epochs by 2.9% box AP and 2.2% mask AP respectively. Code and pretrained models are available at https://github.com/Alpha-VL/ConvMAE.
翻译:视觉变异器( VIT) 被广泛采用, 成为各种视觉任务的建筑。 蒙面自动编码, 用于特别训练前和多规模混合革命- 变异的复合结构, 可以进一步释放VIT的潜力, 导致图像分类、 检测和语义分割方面的最先进的表现。 在本文中, 我们的ConvMAE框架表明, 多规模混合革命- 变异器可以通过蒙面自动编码计划学习更具有歧视性的表现形式。 但是, 直接使用原始遮罩战略, 导致计算成本过高和预训练- 调整差异。 为了解决这个问题, 我们采用蒙面变异式变异器, 以防止在变异区出现信息泄漏。 提议了一个简单的整块化掩码战略, 以确保计算效率。 我们还提议更直接监督编码前的多尺度特性, 以我们经过预先培训的 ConvMAE 模型为基础, ConvMAE- Basebase改进了图像Net-1K 精确度, 比MAE- base- base- basy- basyal- basyal- basional- pretrament on.