Self-attention based transformer models have been dominating many computer vision tasks in the past few years. Their superb model qualities heavily depend on the excessively large labeled image datasets. In order to reduce the reliance on large labeled datasets, reconstruction based masked autoencoders are gaining popularity, which learn high quality transferable representations from unlabeled images. For the same purpose, recent weakly supervised image pretraining methods explore language supervision from text captions accompanying the images. In this work, we propose masked image pretraining on language assisted representation, dubbed as MILAN. Instead of predicting raw pixels or low level features, our pretraining objective is to reconstruct the image features with substantial semantic signals that are obtained using caption supervision. Moreover, to accommodate our reconstruction target, we propose a more efficient prompting decoder architecture and a semantic aware mask sampling mechanism, which further advance the transfer performance of the pretrained model. Experimental results demonstrate that MILAN delivers higher accuracy than the previous works. When the masked autoencoder is pretrained and finetuned on ImageNet-1K dataset with an input resolution of 224x224, MILAN achieves a top-1 accuracy of 85.4% on ViTB/16, surpassing previous state-of-the-arts by 1%. In the downstream semantic segmentation task, MILAN achieves 52.7 mIoU using ViT-B/16 backbone on ADE20K dataset, outperforming previous masked pretraining results by 4 points.
翻译:在过去几年里,基于自我注意的变压器模型一直主导着许多计算机视觉任务。它们的超模模型品质在很大程度上取决于过大的标签图像数据集。为了减少对大标签数据集的依赖,重建基于掩码的自动校正器越来越受欢迎,从未贴标签的图像中学习高质量的可转移表达。为了同样的目的,最近监督不力的图像前训练方法从图像所附的文字说明中探索语言监督。在这项工作中,我们提议对语言辅助代表(称为MILAN)进行隐蔽的图像预培训。与其预测原始像素或低级别特性,我们的培训前的目标是用大量使用标题监管获得的语义信号重建图像特征。此外,为了适应我们的重建目标,我们提议一个更高效的快速解码架构和一个具有语义意识的遮罩取样机制,以进一步推进预先训练模型的传输性能。实验结果显示,当遮掩码自动coder对图像Net-1-K级数据值或低级别特征进行预培训和微调整时,我们的目标是用之前的 ViNet-1-K数据序列, 将前的SIM-16 的SIMAR 实现前的SIMAR 的SIMB 的SIM24 的SD 的上,通过前的SIMB 的SIMB 的高级分辨率分辨率实现前的SIMB 的SIM-DR 。