We propose VisFusion, a visibility-aware online 3D scene reconstruction approach from posed monocular videos. In particular, we aim to reconstruct the scene from volumetric features. Unlike previous reconstruction methods which aggregate features for each voxel from input views without considering its visibility, we aim to improve the feature fusion by explicitly inferring its visibility from a similarity matrix, computed from its projected features in each image pair. Following previous works, our model is a coarse-to-fine pipeline including a volume sparsification process. Different from their works which sparsify voxels globally with a fixed occupancy threshold, we perform the sparsification on a local feature volume along each visual ray to preserve at least one voxel per ray for more fine details. The sparse local volume is then fused with a global one for online reconstruction. We further propose to predict TSDF in a coarse-to-fine manner by learning its residuals across scales leading to better TSDF predictions. Experimental results on benchmarks show that our method can achieve superior performance with more scene details. Code is available at: https://github.com/huiyu-gao/VisFusion
翻译:我们提出了VisFusion,一种可见性感知的在线3D场景重建方法,可以根据单目视频的体素特征进行重建。与以往的重建方法不同,我们旨在通过从相似矩阵中显式推断其可见性来改善特征融合,相似矩阵是从每个图像对中投影的特征计算得到的。与以往的方法不同,我们的模型是一个从粗到细的流水线,包括体素稀疏化过程。我们不像以往的工作那样使用固定的占用阈值在全局范围内稀疏化体素,而是沿着每个视线在本地特征体积上进行了稀疏化,以保留更精细的细节至少一个体素。稀疏的局部体积然后与全局体积融合进行在线重建。我们进一步提出通过跨尺度学习TSDF的残差来预测TSDF,从而实现了从粗到细的预测,从而实现更好的TSDF预测。基准测试的实验结果表明,我们的方法可以实现更高的场景细节,代码可用于:https://github.com/huiyu-gao/VisFusion。