Recent works in self-supervised learning have demonstrated strong performance on scene-level dense prediction tasks by pretraining with object-centric or region-based correspondence objectives. In this paper, we present Region-to-Object Representation Learning (R2O) which unifies region-based and object-centric pretraining. R2O operates by training an encoder to dynamically refine region-based segments into object-centric masks and then jointly learns representations of the contents within the mask. R2O uses a "region refinement module" to group small image regions, generated using a region-level prior, into larger regions which tend to correspond to objects by clustering region-level features. As pretraining progresses, R2O follows a region-to-object curriculum which encourages learning region-level features early on and gradually progresses to train object-centric representations. Representations learned using R2O lead to state-of-the art performance in semantic segmentation for PASCAL VOC (+0.7 mIOU) and Cityscapes (+0.4 mIOU) and instance segmentation on MS COCO (+0.3 mask AP). Further, after pretraining on ImageNet, R2O pretrained models are able to surpass existing state-of-the-art in unsupervised object segmentation on the Caltech-UCSD Birds 200-2011 dataset (+2.9 mIoU) without any further training. We provide the code/models from this work at https://github.com/KKallidromitis/r2o.
翻译:自我监督学习的近期工作显示,通过对目标中心或基于区域的对应目标进行预培训,在现场密集的预测任务中,通过对目标中心或基于区域的对应目标进行预先培训,在现场密集的预测任务中表现良好。在本文件中,我们介绍了区域对目标代表学习(R2O),这种学习将区域基础和以目标为中心的预培训统一起来。R2O通过培训编码器将一个编码器进行操作,以便动态地将基于区域的部分改进成以目标为中心的遮罩,然后在掩罩中共同学习内容的显示。R2O利用“区域改进模块”对小图像区域进行分组,利用区域级以前生成的图像区域,将区域对目标进行对应,将区域对目标进行分组,将区域对目标代表制(R2O)引入区域对目标代表器学习(R2O),鼓励在区域对区域层面进行早期学习,并逐步对目标中心代表器进行训练,然后在PSCAL VOC (+0.7 mIO) 和城市景象(+0.4 mIO) 进一步对MS CO (+0.3 CO) 数据模型进行分解后,在R2 AS-SUI AS