Finding localized correspondences across different images of the same object is crucial to understand its geometry. In recent years, this problem has seen remarkable progress with the advent of deep learning based local image features and learnable matchers. Still, learnable matchers often underperform when there exists only small regions of co-visibility between image pairs (i.e. wide camera baselines). To address this problem, we leverage recent progress in coarse single-view geometry estimation methods. We propose LFM-3D, a Learnable Feature Matching framework that uses models based on graph neural networks, and enhances their capabilities by integrating noisy, estimated 3D signals to boost correspondence estimation. When integrating 3D signals into the matcher model, we show that a suitable positional encoding is critical to effectively make use of the low-dimensional 3D information. We experiment with two different 3D signals - normalized object coordinates and monocular depth estimates - and evaluate our method on large-scale (synthetic and real) datasets containing object-centric image pairs across wide baselines. We observe strong feature matching improvements compared to 2D-only methods, with up to +6% total recall and +28% precision at fixed recall. We additionally demonstrate that the resulting improved correspondences lead to much higher relative posing accuracy for in-the-wild image pairs, with a more than 8% boost compared to the 2D-only approach.
翻译:找到同一个目标的不同图像之间的定位对应关系对于理解其几何形状至关重要。近年来,随着基于深度学习的局部图像特征和可学习匹配算法的出现,这个问题取得了令人瞩目的进展。但是,当存在仅有少量视野重叠的图像对时(即宽相机基线),可学习匹配算法通常表现不佳。为了解决这个问题,我们利用最近发展的单视图粗略几何估计方法。我们提出了LFM-3D——一种可学习的特征匹配框架,该框架使用基于图神经网络的模型,并通过整合嘈杂的估计3D信号来增强其功能以提高对应关系的估计。当将3D信号整合到匹配器模型中时,我们发现适当的位置编码对于有效利用低维3D信息至关重要。我们尝试了两种不同的3D信号——经过归一化的物体坐标和单目深度估计,并在包含宽基线的大规模(合成和真实)数据集上进行了评估,该数据集包含对象中心的图像对。我们观察到与2D-only方法相比,强特征匹配算法具有很大的改进,总召回率提高了6%以上,固定召回率时精度提高了28%以上。此外,我们还证明了结果改进的对应关系导致在野外图像对的相对位姿精度大大提高,与2D-only方法相比提高了8%以上。