Both masked image modeling (MIM) and natural language supervision have facilitated the progress of transferable visual pre-training. In this work, we seek the synergy between two paradigms and study the emerging properties when MIM meets natural language supervision. To this end, we present a novel masked visual Reconstruction In Language semantic Space (RILS) pre-training framework, in which sentence representations, encoded by the text encoder, serve as prototypes to transform the vision-only signals into patch-sentence probabilities as semantically meaningful MIM reconstruction targets. The vision models can therefore capture useful components with structured information by predicting proper semantic of masked tokens. Better visual representations could, in turn, improve the text encoder via the image-text alignment objective, which is essential for the effective MIM target transformation. Extensive experimental results demonstrate that our method not only enjoys the best of previous MIM and CLIP but also achieves further improvements on various tasks due to their mutual benefits. RILS exhibits advanced transferability on downstream classification, detection, and segmentation, especially for low-shot regimes. Code will be made available at https://github.com/hustvl/RILS.
翻译:在这项工作中,我们寻求在两种范式之间取得协同效应,并在MIM达到自然语言监督时研究新出现的特性。为此,我们提出一个新的语言语义空间预培训框架,由文本编码器编码的句子演示,作为将仅视像信号转化为交错概率的原型,作为具有意义的语义重建目标。因此,远景模型可以通过预测隐藏符号的适当语义来利用结构化信息捕捉有用的组成部分。更好的视觉演示可以通过图像文本校正目标来改进文本编码器,这对于有效的MIM目标转型至关重要。广泛的实验结果表明,我们的方法不仅享有以前的MIM和CLIP的最佳功能,而且由于相互利益而使各种任务得到进一步改进。RILS展示了下游分类、检测和分解的高级可转移性,特别是用于低发系统。将在 http://smam和CLIP提供代码。