CoMAE: 小型RGB-D数据集单一模式混合预培训 (CoMAE: Single Model Hybrid Pre-training on Small-Scale RGB-D Datasets)

Current RGB-D scene recognition approaches often train two standalone backbones for RGB and depth modalities with the same Places or ImageNet pre-training. However, the pre-trained depth network is still biased by RGB-based models which may result in a suboptimal solution. In this paper, we present a single-model self-supervised hybrid pre-training framework for RGB and depth modalities, termed as CoMAE. Our CoMAE presents a curriculum learning strategy to unify the two popular self-supervised representation learning algorithms: contrastive learning and masked image modeling. Specifically, we first build a patch-level alignment task to pre-train a single encoder shared by two modalities via cross-modal contrastive learning. Then, the pre-trained contrastive encoder is passed to a multi-modal masked autoencoder to capture the finer context features from a generative perspective. In addition, our single-model design without requirement of fusion module is very flexible and robust to generalize to unimodal scenario in both training and testing phases. Extensive experiments on SUN RGB-D and NYUDv2 datasets demonstrate the effectiveness of our CoMAE for RGB and depth representation learning. In addition, our experiment results reveal that CoMAE is a data-efficient representation learner. Although we only use the small-scale and unlabeled training set for pre-training, our CoMAE pre-trained models are still competitive to the state-of-the-art methods with extra large-scale and supervised RGB dataset pre-training. Code will be released at https://github.com/MCG-NJU/CoMAE.

翻译：当前 RGB-D 场景识别方法经常为 RGB 和深度模式培训两个独立主干网,用相同的地点或图像网络预培训前训练,但是,预先训练的深度网络仍然受到基于RGB的模型的偏差,这些模型可能导致一个亚优的解决方案。在本文中,我们提出了一个单一模型的自我监督混合培训前框架,用于RGB和深度模式,称为COMAE。我们的COMAE 提供了一个课程学习战略,以统一两种受欢迎的自我监督的代表学习算法:对比学习和遮蔽图像模型。具体地说,我们首先建立一个补丁级校前校准任务,以预先训练一个由两种模式通过跨现代对比学习而共享的大型编码。然后,预先训练的对比编码编码编码转换成一个多模范的掩码自动编码,从基因学角度来捕捉精度的背景特征。此外,我们的单一模型设计,不要求整合模块,仍然非常灵活和有力,在培训和测试两个阶段中都采用非模化的外演化情景。 SUN RGBD的大规模实验将显示我们RGB-D的深度数据测试结果。