Multi-modal fusion has been proved to help enhance the performance of scene classification tasks. This paper presents a 2D-3D Fusion stage that combines 3D Geometric Features with 2D Texture Features obtained by 2D Convolutional Neural Networks. To get a robust 3D Geometric embedding, a network that uses two novel layers is proposed. The first layer, Multi-Neighbourhood Graph Convolution, aims to learn a more robust geometric descriptor of the scene combining two different neighbourhoods: one in the Euclidean space and the other in the Feature space. The second proposed layer, Nearest Voxel Pooling, improves the performance of the well-known Voxel Pooling. Experimental results, using NYU-Depth-V2 and SUN RGB-D datasets, show that the proposed method outperforms the current state-of-the-art in RGB-D indoor scene classification task.
翻译:多模式融合已被证明有助于提高现场分类任务的业绩。 本文展示了 2D-3D 融合阶段, 将 3D 几何特征与 2D 进化神经网络获得的 2D 质貌特征结合起来。 要获得一个强大的 3D 几何嵌入, 提议了一个使用两个新层次的网络。 第一层, 多个邻居图层, 旨在学习一个更强健的场景几何描述符, 将两个不同的街区结合起来: 一个在欧西里德空间, 另一个在地貌空间。 第二个拟议层, 近于沃塞尔 集合, 改进了众所周知的沃塞尔 集合 的性能。 实验结果, 使用 NYU- Dept- V2 和 SUN RGB- D 数据集。 显示, 拟议的方法超过了 RGB- D 室内场景分类中目前的状态。