In this paper, we present an integral pre-training framework based on masked image modeling (MIM). We advocate for pre-training the backbone and neck jointly so that the transfer gap between MIM and downstream recognition tasks is minimal. We make two technical contributions. First, we unify the reconstruction and recognition necks by inserting a feature pyramid into the pre-training stage. Second, we complement mask image modeling (MIM) with masked feature modeling (MFM) that offers multi-stage supervision to the feature pyramid. The pre-trained models, termed integrally pre-trained transformer pyramid networks (iTPNs), serve as powerful foundation models for visual recognition. In particular, the base/large-level iTPN achieves an 86.2%/87.8% top-1 accuracy on ImageNet-1K, a 53.2%/55.6% box AP on COCO object detection with 1x training schedule using Mask-RCNN, and a 54.7%/57.7% mIoU on ADE20K semantic segmentation using UPerHead -- all these results set new records. Our work inspires the community to work on unifying upstream pre-training and downstream fine-tuning tasks. Code and the pre-trained models will be released at https://github.com/sunsmarterjie/iTPN.
翻译:在本文中,我们提出了一个基于蒙面图像建模(MIM)的综合培训前框架。我们共同倡导对骨干和颈部进行预培训,以便使MIM和下游识别任务之间的转移差距最小化。我们做出了两项技术贡献。首先,我们通过在培训前阶段插入一个特征金字塔将重建和识别脖子统一起来。第二,我们以蒙面特征建模(MIM)补充蒙面图像建模(MIM),为特征金字塔提供多阶段监督。预先培训的模型(iTPNs),被称为综合培训前的变形金字塔网络(iTPNs),作为视觉识别的强大基础模型。特别是,基础/大型 iTPNN在图像Net-1K上实现了86.2%/87.8%的最高-1精确度,在COO物体探测上实现了53.2%/55.6%的框 AP,在使用Mask-RCNNN和54.7%/57.7%的MIOU,关于使用UPer Checkations-train-train-train-trading Dredustrations.