We describe a data-driven method for inferring the camera viewpoints given multiple images of an arbitrary object. This task is a core component of classic geometric pipelines such as SfM and SLAM, and also serves as a vital pre-processing requirement for contemporary neural approaches (e.g. NeRF) to object reconstruction and view synthesis. In contrast to existing correspondence-driven methods that do not perform well given sparse views, we propose a top-down prediction based approach for estimating camera viewpoints. Our key technical insight is the use of an energy-based formulation for representing distributions over relative camera rotations, thus allowing us to explicitly represent multiple camera modes arising from object symmetries or views. Leveraging these relative predictions, we jointly estimate a consistent set of camera rotations from multiple images. We show that our approach outperforms state-of-the-art SfM and SLAM methods given sparse images on both seen and unseen categories. Further, our probabilistic approach significantly outperforms directly regressing relative poses, suggesting that modeling multimodality is important for coherent joint reconstruction. We demonstrate that our system can be a stepping stone toward in-the-wild reconstruction from multi-view datasets. The project page with code and videos can be found at https://jasonyzhang.com/relpose.
翻译:我们描述一种基于数据驱动的方法,用于对任意物体的多个图像进行相机视图的推断。 这项任务是SfM 和 SLAM 等经典几何管道的核心组成部分, 也是当代神经学方法( 如 NERF ) 用于反对重建和查看合成的关键预处理要求。 与现有的通信驱动方法相比, 我们提出了一个基于自上而下预测的方法, 以估算相机视图。 我们的关键技术洞察力是使用一种基于能源的配方来代表相对相机旋转的分布, 从而允许我们明确代表由对象对称或视图产生的多种相机模式。 利用这些相对预测, 我们共同估计了一组来自多个图像的一致的相机旋转。 我们显示, 我们的方法优于现有通信驱动方法, 与不具有深度的 SfM 和 SLAM 方法相比, 我们给出了一种基于可见和不可见的图像的稀疏漏状态。 此外, 我们的预测性方法明显超越了直接反向相对配置的形状, 表明模拟多式联运对于连贯的联合重建很重要。 我们用这些相对预测, 我们共同估计了这些相对的系统, 我们用多面的图纸面的图可以找到。 我们的图的图的图的图像可以成为一个面向的图的图。 。