We revisit Semantic Scene Completion (SSC), a useful task to predict the semantic and occupancy representation of 3D scenes, in this paper. A number of methods for this task are always based on voxelized scene representations for keeping local scene structure. However, due to the existence of visible empty voxels, these methods always suffer from heavy computation redundancy when the network goes deeper, and thus limit the completion quality. To address this dilemma, we propose our novel point-voxel aggregation network for this task. Firstly, we transfer the voxelized scenes to point clouds by removing these visible empty voxels and adopt a deep point stream to capture semantic information from the scene efficiently. Meanwhile, a light-weight voxel stream containing only two 3D convolution layers preserves local structures of the voxelized scenes. Furthermore, we design an anisotropic voxel aggregation operator to fuse the structure details from the voxel stream into the point stream, and a semantic-aware propagation module to enhance the up-sampling process in the point stream by semantic labels. We demonstrate that our model surpasses state-of-the-arts on two benchmarks by a large margin, with only depth images as the input.
翻译:我们在本文中重新思考了语义化场景补全(SSC)这个有用的任务,通过点-体元聚合网络,透过体元的角度来解决问题。在此任务中,基于体元表示的场景在保持局部场景结构方面存在重大限制。由于存在可见的空白体元,这些方法总是面临着在网络深度加深时出现的大量冗余计算的问题,并限制着补全质量。为了解决这个问题,我们提出了一种新颖的点-体元聚合网络。首先,我们通过移除这些可见的空白体元,将体元化的场景转换为点云,并采用深度点流以有效地捕获场景中的语义信息。同时,仅包含两个 3D 卷积层的轻量化体元流保留了体元化场景的局部结构。此外,我们设计了一种异向体元聚合操作符,将体元流中的结构细节融合到点流中,并通过语义标签增强点流中的上采样过程的语义感知传播模块。我们证明了在仅使用深度图像作为输入的情况下,我们的模型在两个基准测试中都超过了现有技术的水平。