Existing deep methods produce highly accurate 3D reconstructions in stereo and multiview stereo settings, i.e., when cameras are both internally and externally calibrated. Nevertheless, the challenge of simultaneous recovery of camera poses and 3D scene structure in multiview settings with deep networks is still outstanding. Inspired by projective factorization for Structure from Motion (SFM) and by deep matrix completion techniques, we propose a neural network architecture that, given a set of point tracks in multiple images of a static scene, recovers both the camera parameters and a (sparse) scene structure by minimizing an unsupervised reprojection loss. Our network architecture is designed to respect the structure of the problem: the sought output is equivariant to permutations of both cameras and scene points. Notably, our method does not require initialization of camera parameters or 3D point locations. We test our architecture in two setups: (1) single scene reconstruction and (2) learning from multiple scenes. Our experiments, conducted on a variety of datasets in both internally calibrated and uncalibrated settings, indicate that our method accurately recovers pose and structure, on par with classical state of the art methods. Additionally, we show that a pre-trained network can be used to reconstruct novel scenes using inexpensive fine-tuning with no loss of accuracy.
翻译:现有深层方法在立体和多视图立体设置中产生高度准确的立体重建3D,即当相机在内部和外部同时校准时,尽管如此,同时在具有深网络的多视图设置中同时恢复摄像头和立体场结构的挑战仍然有待解决。受结构动力(SFM)和深层矩阵完成技术的预测因素因素的启发,我们提议了一个神经网络结构,在静态场景的多张图像中考虑到一组点轨迹,通过尽可能减少未经监督的再预测损失,恢复相机参数和(粗略)场景结构。我们的网络结构旨在尊重问题的结构:所寻求的输出对摄像头和场景点的变异性,特别是我们的方法不需要对摄像参数或3D点点位置进行初始化。我们用两种设置来测试我们的建筑:(1) 单一场景重建,(2) 从多个场景中学习。我们在内部校准和未校准的场景环境的各种数据集上进行的实验,表明我们的方法准确恢复了结构和结构,与传统的艺术网络的精确性调整方法相比,我们无法用新的成本调整。