Pixel-level labels are particularly expensive to acquire. Hence, pretraining is a critical step to improve models on a task like semantic segmentation. However, prominent algorithms for pretraining neural networks use image-level objectives, e.g. image classification, image-text alignment a la CLIP, or self-supervised contrastive learning. These objectives do not model spatial information, which might be suboptimal when finetuning on downstream tasks with spatial reasoning. In this work, we propose to pretrain networks for semantic segmentation by predicting the relative location of image parts. We formulate this task as a classification problem where each patch in a query view has to predict its position relatively to another reference view. We control the difficulty of the task by masking a subset of the reference patch features visible to those of the query. Our experiments show that this location-aware (LOCA) self-supervised pretraining leads to representations that transfer competitively to several challenging semantic segmentation benchmarks.
翻译:获取像素级标签特别昂贵。 因此, 预培训是改进像语义分离这样的任务模型的关键步骤。 但是, 培训前神经网络的著名算法使用图像级目标, 如图像分类、 图像- 文本对齐 la CLIP 、 或自我监督对比学习等。 这些目标并不模拟空间信息, 在用空间推理对下游任务进行微调时, 空间信息可能不尽理想。 在这项工作中, 我们提议通过预测图像部件的相对位置来预设语义分割网络。 我们将此任务设计成一个分类问题, 询问视图中的每个补丁都必须预测其相对位置与另一个引用视图相对的位置。 我们通过遮盖查询中可见的参考补差特征来控制任务难度。 我们的实验表明, 定位识别( LOCA) 自我监督的预培训可以导致有竞争力的演示, 转移到若干具有挑战性的语义分割基准 。