Scene recognition is one of the basic problems in computer vision research with extensive applications in robotics. When available, depth images provide helpful geometric cues that complement the RGB texture information and help to identify discriminative scene image features. Depth sensing technology developed fast in the last years and a great variety of 3D cameras have been introduced, each with different acquisition properties. However, those properties are often neglected when targeting big data collections, so multi-modal images are gathered disregarding their original nature. In this work, we put under the spotlight the existence of a possibly severe domain shift issue within multi-modality scene recognition datasets. As a consequence, a scene classification model trained on one camera may not generalize on data from a different camera, only providing a low recognition performance. Starting from the well-known SUN RGB-D dataset, we designed an experimental testbed to study this problem and we use it to benchmark the performance of existing methods. Finally, we introduce a novel adaptive scene recognition approach that leverages self-supervised translation between modalities. Indeed, learning to go from RGB to depth and vice-versa is an unsupervised procedure that can be trained jointly on data of multiple cameras and may help to bridge the gap among the extracted feature distributions. Our experimental results confirm the effectiveness of the proposed approach.
翻译:在机器人中广泛应用的计算机视觉研究中,这些特征的识别是基本问题之一。当有深度图像时,它提供了有用的几何提示,补充了 RGB 纹理信息,并有助于识别有区别的场景图像特征。在过去几年里迅速开发了深度遥感技术,并采用了各种各样的3D相机,每个相机都有不同的获取特性。然而,这些特性在针对大型数据收集时往往被忽视,因此,多式图像的收集无视其原始性质。在这项工作中,我们把多式场景识别数据集中可能存在的严重域变换问题放在了焦点之下。因此,在一台相机上培训过的场景分类模型可能无法对不同相机的数据进行概括化,而只能提供低度的识别性能。从众所周知的 SUN RGB-D 数据集开始,我们设计了一个实验性测试台来研究这一问题,我们用它来衡量现有方法的性能。最后,我们引入了一种新的适应性场景识别方法,利用多种模式之间的自我监督翻译。事实上,从 RGB 到深度和反向下方的图像分类模型模型可能是一种不严密的模型分布方式,可以用来证实我们的多式摄像机。