We introduce Correlational Image Modeling (CIM), a novel and surprisingly effective approach to self-supervised visual pre-training. Our CIM performs a simple pretext task: we randomly crop image regions (exemplars) from an input image (context) and predict correlation maps between the exemplars and the context. Three key designs enable correlational image modeling as a nontrivial and meaningful self-supervisory task. First, to generate useful exemplar-context pairs, we consider cropping image regions with various scales, shapes, rotations, and transformations. Second, we employ a bootstrap learning framework that involves online and target encoders. During pre-training, the former takes exemplars as inputs while the latter converts the context. Third, we model the output correlation maps via a simple cross-attention block, within which the context serves as queries and the exemplars offer values and keys. We show that CIM performs on par or better than the current state of the art on self-supervised and transfer benchmarks.
翻译:我们介绍了一种称为关联图像建模(CIM)的新颖而出人意料的自我监督视觉预训练方法。我们的CIM执行一项简单的前提任务:我们从输入图像(上下文)随机裁剪图像区域(示例)并预测示例和上下文之间的关联地图。三个关键设计使关联图像建模成为一项非俗套且有意义的自我监督任务。首先,为了生成有用的示例-上下文对,我们考虑采用各种尺度、形状、旋转和变换对图像区域进行裁剪。其次,我们采用了一种引导式学习框架,涉及在线编码器和目标编码器。在预训练期间,前者以示例作为输入,后者转换为上下文。第三,我们通过简单的交叉注意块对输出的关联地图进行了建模,在其中上下文作为查询而示例提供值和键。我们展示了CIM在自我监督和迁移基准测试上性能与当前技术水平相当或更好。