Monocular scene reconstruction from posed images is challenging due to the complexity of a large environment. Recent volumetric methods learn to directly predict the TSDF volume and have demonstrated promising results in this task. However, most methods focus on how to extract and fuse the 2D features to a 3D feature volume, but none of them improve the way how the 3D volume is aggregated. In this work, we propose an SDF transformer network, which replaces the role of 3D CNN for better 3D feature aggregation. To reduce the explosive computation complexity of the 3D multi-head attention, we propose a sparse window attention module, where the attention is only calculated between the non-empty voxels within a local window. Then a top-down-bottom-up 3D attention network is built for 3D feature aggregation, where a dilate-attention structure is proposed to prevent geometry degeneration, and two global modules are employed to equip with global receptive fields. The experiments on multiple datasets show that this 3D transformer network generates a more accurate and complete reconstruction, which outperforms previous methods by a large margin. Remarkably, the mesh accuracy is improved by 41.8%, and the mesh completeness is improved by 25.3% on the ScanNet dataset. Project page: https://weihaosky.github.io/sdfformer.
翻译:由显示的图像进行单向图像的重建由于大环境的复杂性而具有挑战性。 最近的体积方法学会直接预测TSDF卷, 并展示了这一任务中令人乐观的结果。 然而, 多数方法侧重于如何提取2D特性并将其结合到 3D 特性卷中, 但这些方法都没有改进 3D 特性卷集的方法。 在这项工作中, 我们建议建立一个 SDF 变压器网络, 取代 3D CNN 功能集成 3D 特性集成的作用 。 为了降低 3D 多头关注的爆炸性计算复杂性, 我们提议了一个微弱的窗口关注模块, 其关注度只能由本地窗口中的非空的 voxel 计算出来。 然后为 3D 特性集成建立一个上下自下三维特性集成的 3D 3D 特性集成网络, 但没有一项改进 3D 3D 3D 3D 特性卷集成 。 在多个数据集上进行的实验显示, 这个 3D 变压器网络产生更精确和完整的重建, 它比以往的方法大边缘。 remarforforforformalls。 reformagiewd. refild.