We present a new self-supervised pre-training of Vision Transformers for dense prediction tasks. It is based on a contrastive loss across views that compares pixel-level representations to global image representations. This strategy produces better local features suitable for dense prediction tasks as opposed to contrastive pre-training based on global image representation only. Furthermore, our approach does not suffer from a reduced batch size since the number of negative examples needed in the contrastive loss is in the order of the number of local features. We demonstrate the effectiveness of our pre-training strategy on two dense prediction tasks: semantic segmentation and monocular depth estimation.
翻译:我们提出了对视野变异器进行密集预测任务自我监督的新的自我监督前培训,其依据是各种观点之间的对比性损失,这些观点将像素层面的表示方式与全球图像表示方式作比较。这一战略产生更好的当地特征,适合密集预测任务,而不是仅以全球图像表示方式为依据的对比性预先培训。此外,我们的方法并不因批量减少而受影响,因为对比性损失中需要的负面实例数量与当地特征的数量相近。我们展示了我们关于两种密集预测任务的培训前战略的有效性:语义分解和单眼深度估计。