Local feature matching is a computationally intensive task at the subpixel level. While detector-based methods coupled with feature descriptors struggle in low-texture scenes, CNN-based methods with a sequential extract-to-match pipeline, fail to make use of the matching capacity of the encoder and tend to overburden the decoder for matching. In contrast, we propose a novel hierarchical extract-and-match transformer, termed as MatchFormer. Inside each stage of the hierarchical encoder, we interleave self-attention for feature extraction and cross-attention for feature matching, yielding a human-intuitive extract-and-match scheme. Such a match-aware encoder releases the overloaded decoder and makes the model highly efficient. Further, combining self- and cross-attention on multi-scale features in a hierarchical architecture improves matching robustness, particularly in low-texture indoor scenes or with less outdoor training data. Thanks to such a strategy, MatchFormer is a multi-win solution in efficiency, robustness, and precision. Compared to the previous best method in indoor pose estimation, our lite MatchFormer has only 45% GFLOPs, yet achieves a +1.3% precision gain and a 41% running speed boost. The large MatchFormer reaches state-of-the-art on four different benchmarks, including indoor pose estimation (ScanNet), outdoor pose estimation (MegaDepth), homography estimation and image matching (HPatch), and visual localization (InLoc).
翻译:本地特性匹配是亚像素级的计算密集型任务。 虽然基于检测器和基于特征描述器的特征描述器在低文本场景中挣扎, 以CNN为基础的方法与连续提取到匹配管道相匹配, 无法使用编码器的匹配能力, 并且往往使解码器的匹配能力过重。 相反, 我们提议了一个新型的级级提取和匹配变压器, 称为 MatchFormer 。 在分级编码器的每个阶段, 我们同时使用基于检测器的自控方法, 伴以特征提取和对特征匹配的交叉关注, 产生一种人直观提取和直径匹配的方法。 这种匹配的编码器释放了过重的解码器, 并且使模型效率很高。 此外, 在一个等级结构中, 将多尺度的自我和交叉关注功能结合起来, 特别是在低脂室或更不露天培训数据中。 得益于这样的战略, MatchFormer 是一个在效率、 稳健度和精度上的一个多赢的匹配解决方案。 与先前的最佳方法相比, 包括: GFlick Pro 4 IM IM IM 和 IM IM IM 的,,, a a 。