Camera-based 3D Semantic Occupancy Prediction (SOP) is crucial for understanding complex 3D scenes from limited 2D image observations. Existing SOP methods typically aggregate contextual features to assist the occupancy representation learning, alleviating issues like occlusion or ambiguity. However, these solutions often face misalignment issues wherein the corresponding features at the same position across different frames may have different semantic meanings during the aggregation process, which leads to unreliable contextual fusion results and an unstable representation learning process. To address this problem, we introduce a new Hierarchical context alignment paradigm for a more accurate SOP (Hi-SOP). Hi-SOP first disentangles the geometric and temporal context for separate alignment, which two branches are then composed to enhance the reliability of SOP. This parsing of the visual input into a local-global alignment hierarchy includes: (I) disentangled geometric and temporal separate alignment, within each leverages depth confidence and camera pose as prior for relevant feature matching respectively; (II) global alignment and composition of the transformed geometric and temporal volumes based on semantics consistency. Our method outperforms SOTAs for semantic scene completion on the SemanticKITTI & NuScenes-Occupancy datasets and LiDAR semantic segmentation on the NuScenes dataset. The project website is available at https://arlo0o.github.io/hisop.github.io/.
翻译:基于相机的三维语义占据预测(SOP)对于从有限的二维图像观测中理解复杂三维场景至关重要。现有SOP方法通常聚合上下文特征以辅助占据表示学习,从而缓解遮挡或模糊等问题。然而,这些方案在聚合过程中常面临错位问题,即不同帧中同一位置对应的特征可能具有不同的语义含义,这会导致不可靠的上下文融合结果和不稳定的表示学习过程。为解决此问题,我们提出了一种新的分层上下文对齐范式以实现更精确的SOP(Hi-SOP)。Hi-SOP首先解耦几何与时序上下文以进行独立对齐,随后将两个分支组合以增强SOP的可靠性。这种将视觉输入解析为局部-全局对齐层次结构的方法包括:(I)解耦的几何与时序独立对齐,其中分别利用深度置信度和相机姿态作为先验进行相关特征匹配;(II)基于语义一致性的变换后几何与时序体素全局对齐与融合。我们的方法在SemanticKITTI和NuScenes-Occupancy数据集上的语义场景补全任务,以及在NuScenes数据集上的LiDAR语义分割任务中均优于现有最优方法。项目网站详见https://arlo0o.github.io/hisop.github.io/。