We introduce Correlational Image Modeling (CIM), a novel and surprisingly effective approach to self-supervised visual pre-training. Our CIM performs a simple pretext task: we randomly crop image regions (exemplars) from an input image (context) and predict correlation maps between the exemplars and the context. Three key designs enable correlational image modeling as a nontrivial and meaningful self-supervisory task. First, to generate useful exemplar-context pairs, we consider cropping image regions with various scales, shapes, rotations, and transformations. Second, we employ a bootstrap learning framework that involves online and target encoders. During pre-training, the former takes exemplars as inputs while the latter converts the context. Third, we model the output correlation maps via a simple cross-attention block, within which the context serves as queries and the exemplars offer values and keys. We show that CIM performs on par or better than the current state of the art on self-supervised and transfer benchmarks.
翻译:我们引入了Correlational Image Modeling(CIM),这是一种新颖且令人惊讶地有效的自监督视觉预训练方法。我们的CIM执行一个简单的前提任务:从输入图像(上下文)随机裁剪图像区域(示例)并预测示例与上下文之间的相关性图。三个关键设计使得相关图像建模成为一项非平凡且有意义的自我监督任务。首先,为了生成有用的示例 - 上下文对,我们考虑裁剪具有不同比例、形状、旋转和变换的图像区域。其次,我们采用一个引导式学习框架,涉及在线和目标编码器。在预训练期间,前者以示例作为输入,而后者将上下文转换为编码。第三,在简单的交互式多头注意力块中,我们通过上下文作为查询,示例提供值和键来建模输出的相关图。我们展示了CIM在自监督和迁移基准上的表现与当前的最新技术相当或更好。