Learning-based approaches have become indispensable for camera pose estimation. However, feature detection, description, matching, and pose optimization are often approached in an isolated fashion. In particular, erroneous feature matches have severe impact on subsequent camera pose estimation and often require additional measures such as outlier rejection. Our method tackles this challenge by addressing feature matching and pose optimization jointly: first, we integrate information from multiple views into the matching by spanning a graph attention network across multiple frames to predict their matches all at once. Second, the resulting matches along with their predicted confidences are used for robust pose optimization with a differentiable Gauss-Newton solver. End-to-end training combined with multi-view feature matching boosts the pose estimation metrics compared to SuperGlue by 8.9% on ScanNet and 10.7% on MegaDepth on average. Our approach improves both pose estimation and matching accuracy over state-of-the-art matching networks. Training feature matching across multiple views with gradients from pose optimization naturally learns to disregard outliers, thereby rendering additional outlier handling unnecessary, which is highly desirable for pose estimation systems.
翻译:以学习为基础的方法已经变得对相机构成估计不可或缺。 然而, 特征检测、 描述、 匹配和显示优化往往以孤立的方式进行。 特别是, 错误的特征匹配对随后的相机有重大影响, 并会给后续的相机带来严重影响, 往往需要额外的措施, 如外部排斥。 我们的方法通过处理特征匹配和共同显示优化来应对这一挑战: 首先, 我们通过跨越多个框架的图形关注网络, 将多种观点的信息整合到匹配中, 以同时预测它们的匹配。 其次, 由此产生的匹配及其预测信任都用于与不同的高斯- 纽顿解答器进行稳健的组合优化。 端对端培训加上多视匹配功能匹配, 与多视匹配功能匹配, 平均将ScanNet的配置估计指标提升8.9%, MegaDepeh 平均提升10.7% 。 我们的方法改进了对最新匹配网络的估算和匹配准确度, 将多重观点与从显示优化自然学习的梯度匹配, 从而忽略外端, 使得额外的外部处理变得没有必要, 这对于配置系统非常可取 。