Self-supervised representation learning for visual pre-training has achieved remarkable success with sample (instance or pixel) discrimination and semantics discovery of instance, whereas there still exists a non-negligible gap between pre-trained model and downstream dense prediction tasks. Concretely, these downstream tasks require more accurate representation, in other words, the pixels from the same object must belong to a shared semantic category, which is lacking in the previous methods. In this work, we present Dense Semantic Contrast (DSC) for modeling semantic category decision boundaries at a dense level to meet the requirement of these tasks. Furthermore, we propose a dense cross-image semantic contrastive learning framework for multi-granularity representation learning. Specially, we explicitly explore the semantic structure of the dataset by mining relations among pixels from different perspectives. For intra-image relation modeling, we discover pixel neighbors from multiple views. And for inter-image relations, we enforce pixel representation from the same semantic class to be more similar than the representation from different classes in one mini-batch. Experimental results show that our DSC model outperforms state-of-the-art methods when transferring to downstream dense prediction tasks, including object detection, semantic segmentation, and instance segmentation. Code will be made available.
翻译:视觉培训前的自我监督代表性学习取得了显著的成功,通过抽样( Instance 或像素) 歧视和语义发现实例,取得了显著的成功,尽管在预先培训的模型和下游密集的预测任务之间仍然存在着不可忽略的差距。具体地说,这些下游任务需要更精确的表述,换句话说,同一对象的像素必须属于共同的语义类,而以前的方法中还缺乏这一类。在这项工作中,我们介绍Dense Semantic Contrast(DSC)在密集的层次上模拟语义类决定界限,以满足这些任务的要求。此外,我们提议为多语义代表性学习提出一个密集的跨图像语义对比学习框架。具体地说,我们明确探索由不同角度的像素之间采矿关系所设置的语义结构。对于内像系关系建模模型,我们从多个观点中发现像素邻居。对于图像关系,我们将从同一语义类中执行像系代表,以更接近于不同等级的图像表达方式, 包括一个迷你- SS 路路段的图像分析任务, 将显示我们现有的Sil- SSDlovation- sy- sal exmal exmission exultal exmissional exmal exmal resmal resmal resmal ress。