Our goal is to learn a deep network that, given a small number of images of an object of a given category, reconstructs it in 3D. While several recent works have obtained analogous results using synthetic data or assuming the availability of 2D primitives such as keypoints, we are interested in working with challenging real data and with no manual annotations. We thus focus on learning a model from multiple views of a large collection of object instances. We contribute with a new large dataset of object centric videos suitable for training and benchmarking this class of models. We show that existing techniques leveraging meshes, voxels, or implicit surfaces, which work well for reconstructing isolated objects, fail on this challenging data. Finally, we propose a new neural network design, called warp-conditioned ray embedding (WCR), which significantly improves reconstruction while obtaining a detailed implicit representation of the object surface and texture, also compensating for the noise in the initial SfM reconstruction that bootstrapped the learning process. Our evaluation demonstrates performance improvements over several deep monocular reconstruction baselines on existing benchmarks and on our novel dataset.
翻译:我们的目标是学习一个深层次的网络,根据某一类别对象的少量图像,将其重建为3D。 虽然最近的一些工程已经利用合成数据或假设存在2D原始数据(如关键点)获得了类似结果,但我们有兴趣与具有挑战性的实际数据合作,没有手动说明。 因此,我们侧重于从大量物体实例收集的多种观点中学习一个模型。我们贡献了适合培训和确定本类模型基准的新的大型以物体为中心的视频数据集。我们展示了利用现有技术,这些技术在重建孤立物体方面效果良好,但这一富有挑战性的数据失败了。最后,我们提出了一个新的神经网络设计,称为有节奏的射线嵌入(WCR),它大大改进了重建,同时获得了物体表面和质谱的详细隐含的描述,也补偿了最初SfM重建中绕过学习过程的噪音。我们的评估表明,在现有基准和新数据集上的若干深度单项重建基线上取得了绩效改进。