Indoor scene images usually contain scattered objects and various scene layouts, which make RGB-D scene classification a challenging task. Existing methods still have limitations for classifying scene images with great spatial variability. Thus, how to extract local patch-level features effectively using only image labels is still an open problem for RGB-D scene recognition. In this paper, we propose an efficient framework for RGB-D scene recognition, which adaptively selects important local features to capture the great spatial variability of scene images. Specifically, we design a differentiable local feature selection (DLFS) module, which can extract the appropriate number of key local scenerelated features. Discriminative local theme-level and object-level representations can be selected with the DLFS module from the spatially-correlated multi-modal RGB-D features. We take advantage of the correlation between RGB and depth modalities to provide more cues for selecting local features. To ensure that discriminative local features are selected, the variational mutual information maximization loss is proposed. Additionally, the DLFS module can be easily extended to select local features of different scales. By concatenating the local-orderless and global structured multi-modal features, the proposed framework can achieve state-of-the-art performance on public RGB-D scene recognition datasets.
翻译:室内图像通常包含分散的天体和各种场景布局,这使得 RGB-D 场景分类是一项具有挑战性的任务。现有方法在对场景图像进行高度空间变异性分类方面仍然有局限性。因此,如何只使用图像标签有效地提取局部补丁级特征对于RGB-D 场景识别来说仍然是一个尚未解决的问题。我们在本文件中提议了一个有效的RGB-D 场景识别框架,通过适应性选择重要的当地特征以捕捉场景图像的巨大空间变异性。具体地,我们设计了一个可区别的地方特征选择模块,可以抽出与现场有关的关键特征的适当数量。从空间-cor-D 多模式 RGB-D 上选择不同的地方主题级别和目标级别的表达方式。我们利用RGB 和深度模式之间的相互关系来提供更多选择地方特征的提示。为了确保选择具有歧视性的地方特征,我们提议了差异性相互信息最大化损失。此外,DLFS 模块可以很容易扩展到选择不同规模的本地特征。从空间-cormalive legraphal 级别选择与目标层次的D-levelopmental-deal-deal-de-de-laction-d-d-d-lavelopmental-d-d-d-d-d-d-d-d-lafal-d-d-d-d-d-pal-d-d-d-d-d-d-d-d-d-d-d-d-d-d-palgalgal-d-d-d-d-d-d-d-d-d-d-d-d-d-d-d-d-d-d-d-d-d-d-d-d-d-d-d-d-d-d-d-d-d-d-d-d-d-d-d-d-d-d-d-d-d-d-d-d-d-d-d-dal-d-d-d-d-d-d-d-d-d-d-d-d-d-d-d-d-d-d-d-d-d-d-d-d-d-d-d-d-d-d-d-d-d-d-