We introduce TransformerFusion, a transformer-based 3D scene reconstruction approach. From an input monocular RGB video, the video frames are processed by a transformer network that fuses the observations into a volumetric feature grid representing the scene; this feature grid is then decoded into an implicit 3D scene representation. Key to our approach is the transformer architecture that enables the network to learn to attend to the most relevant image frames for each 3D location in the scene, supervised only by the scene reconstruction task. Features are fused in a coarse-to-fine fashion, storing fine-level features only where needed, requiring lower memory storage and enabling fusion at interactive rates. The feature grid is then decoded to a higher-resolution scene reconstruction, using an MLP-based surface occupancy prediction from interpolated coarse-to-fine 3D features. Our approach results in an accurate surface reconstruction, outperforming state-of-the-art multi-view stereo depth estimation methods, fully-convolutional 3D reconstruction approaches, and approaches using LSTM- or GRU-based recurrent networks for video sequence fusion.
翻译:我们引入了基于变压器的3D场景重建方法,即变压器Fusion。从输入单镜 RGB 视频中,视频框由一个变压器网络处理,该变压器网络将观测结果结合到代表场景的体积特征网格中;然后,这一地格网被解码成隐含的3D场景代表。我们的方法的关键是变压器结构,使网络能够学习对场景中每个3D地点最相关的图像框架,仅受现场重建任务监督。特征以粗略到平面的方式结合,只在需要的地方储存精细水平的特征,需要更低的内存存储,并能够以互动速度进行融合。然后,对地格网进行解码,以便进行更清晰的场景重建,使用基于多光速到线的3D特征的 MLP 地面占用预测。我们的方法在精确的地表重建、优于状态的多视图立声波深度估计方法、全面革命3D重建方法,以及使用基于LSTM或GRU的经常性网络进行视频波波波波波波波波波波波的频率的频率组合连接的处理。