Self-Supervised Learning (SSL) methods operate on unlabeled data to learn robust representations useful for downstream tasks. Most SSL methods rely on augmentations obtained by transforming the 2D image pixel map. These augmentations ignore the fact that biological vision takes place in an immersive three-dimensional, temporally contiguous environment, and that low-level biological vision relies heavily on depth cues. Using a signal provided by a pretrained state-of-the-art monocular RGB-to-depth model (the \emph{Depth Prediction Transformer}, Ranftl et al., 2021), we explore two distinct approaches to incorporating depth signals into the SSL framework. First, we evaluate contrastive learning using an RGB+depth input representation. Second, we use the depth signal to generate novel views from slightly different camera positions, thereby producing a 3D augmentation for contrastive learning. We evaluate these two approaches on three different SSL methods -- BYOL, SimSiam, and SwAV -- using ImageNette (10 class subset of ImageNet), ImageNet-100 and ImageNet-1k datasets. We find that both approaches to incorporating depth signals improve the robustness and generalization of the baseline SSL methods, though the first approach (with depth-channel concatenation) is superior. For instance, BYOL with the additional depth channel leads to an increase in downstream classification accuracy from 85.3\% to 88.0\% on ImageNette and 84.1\% to 87.0\% on ImageNet-C.
翻译:自我监督学习(SSL) 方法在未贴标签的数据上运行, 以学习对下游任务有用的可靠表现。 多数 SSL 方法依赖于通过转换 2D 图像像素映像图而获得的增强。 这些增强忽略了生物视觉是在隐性三维、 时间相连的环境中发生的这一事实, 以及低层次的生物视觉在很大程度上依赖于深度提示。 使用预先培训的状态单格 RGB 至深度模型( emte{ Depte{ Diutimillion Greener}, Ranftl et al., 2021) 提供的信号。 我们探索将深度信号纳入 SSL 框架的两种不同方法。 首先, 我们用 RGB + 深度输入图象仪来评估对比性学习过程。 其次, 我们使用深度信号从略不同的相机位置产生新的观点, 从而产生3D 增强对比性学习。 我们用三种不同的 SSLL 方法来评估这两种方法 -- BYOL、 SimSiam_ 和 Swa- swead- slove (10 ligal sual et et et et et et) laimNet to the laimnetal decreal decal decreal deal decal decreal develut.