Self-supervised representation learning based on Contrastive Learning (CL) has been the subject of much attention in recent years. This is due to the excellent results obtained on a variety of subsequent tasks (in particular classification), without requiring a large amount of labeled samples. However, most reference CL algorithms (such as SimCLR and MoCo, but also BYOL and Barlow Twins) are not adapted to pixel-level downstream tasks. One existing solution known as PixPro proposes a pixel-level approach that is based on filtering of pairs of positive/negative image crops of the same image using the distance between the crops in the whole image. We argue that this idea can be further enhanced by incorporating semantic information provided by exogenous data as an additional selection filter, which can be used (at training time) to improve the selection of the pixel-level positive/negative samples. In this paper we will focus on the depth information, which can be obtained by using a depth estimation network or measured from available data (stereovision, parallax motion, LiDAR, etc.). Scene depth can provide meaningful cues to distinguish pixels belonging to different objects based on their depth. We show that using this exogenous information in the contrastive loss leads to improved results and that the learned representations better follow the shapes of objects. In addition, we introduce a multi-scale loss that alleviates the issue of finding the training parameters adapted to different object sizes. We demonstrate the effectiveness of our ideas on the Breakout Segmentation on Borehole Images where we achieve an improvement of 1.9\% over PixPro and nearly 5\% over the supervised baseline. We further validate our technique on the indoor scene segmentation tasks with ScanNet and outdoor scenes with CityScapes ( 1.6\% and 1.1\% improvement over PixPro respectively).
翻译:根据对比学习(CL)进行自我监督的内下游代表学习是近年来人们非常关注的一个问题。这是因为在一系列后续任务(特别是分类)上取得了优异的结果,不需要大量标签样本。然而,大多数参考 CL 算法(如SimCLR和MoCo,但也包括BYOL和Barlow Twins)没有适应像素级下游任务。一个称为PixPro 的现有解决方案提出了一种像素级的平流级方法,其基础是利用整个图像中作物之间的距离过滤同一图像的正/负图像作物的直流参数。我们争辩说,通过将外部数据提供的市级信息作为额外的选择过滤器(如SimCLRR和MOCO),可以进一步加强这一想法。在培训时间里,可以(BYOL)和Barlow Tele 双向下层样本的选择。在本文件中,我们将侧重于深度信息,通过深度估算网络或从现有数据(Stervision、parlax Movealal、Liard aliar alial liction et) 中测量同一图像的成色的成像。我们更精确的图像的更深层数据,我们可以分别显示一个更精确的升级到更精确的显示,从而显示一个更精确的图。我们更精确的图。我们更精确的显示到更精确的显示。我们更精确的图。我们更精确到更精确到更精确的图,我们更精确的图。我们更精确的图。我们更能,可以显示到更精确到更精确的图。