Dense geometric matching determines the dense pixel-wise correspondence between a source and support image corresponding to the same 3D structure. Prior works employ an encoder of transformer blocks to correlate the two-frame features. However, existing monocular pretraining tasks, e.g., image classification, and masked image modeling (MIM), can not pretrain the cross-frame module, yielding less optimal performance. To resolve this, we reformulate the MIM from reconstructing a single masked image to reconstructing a pair of masked images, enabling the pretraining of transformer module. Additionally, we incorporate a decoder into pretraining for improved upsampling results. Further, to be robust to the textureless area, we propose a novel cross-frame global matching module (CFGM). Since the most textureless area is planar surfaces, we propose a homography loss to further regularize its learning. Combined together, we achieve the State-of-The-Art (SoTA) performance on geometric matching. Codes and models are available at https://github.com/ShngJZ/PMatch.
翻译:密集几何匹配是确定源图像和支持图像之间相同 3D 结构的密集像素级对应关系。先前的工作利用 transformer 块的编码器来关联两帧的特征。但是,现有的单视图预训练任务,例如图像分类和掩蔽图像建模 (MIM),不能预训练跨帧模块,导致性能不够优化。为解决这个问题,我们将 MIM 从重构单个掩蔽图像重新表述为重构一对掩蔽图像,以启用 transformer 模块的预训练。此外,我们在预训练中加入了解码器,以获得更好的上采样结果。此外,为了对纹理无法识别的区域具有鲁棒性,我们提出了一种新的跨帧全局匹配模块 (CFGM)。由于大部分纹理无法识别的区域是平面表面,因此我们提出了一种单应性损失来进一步规范其学习。将这些方法相结合,我们实现了几何匹配的最先进性能 (SoTA)。代码和模型可在 https://github.com/ShngJZ/PMatch 下载。