Local feature matching between images remains a challenging task, especially in the presence of significant appearance variations, e.g., extreme viewpoint changes. In this work, we propose DeepMatcher, a deep Transformer-based network built upon our investigation of local feature matching in detector-free methods. The key insight is that local feature matcher with deep layers can capture more human-intuitive and simpler-to-match features. Based on this, we propose a Slimming Transformer (SlimFormer) dedicated for DeepMatcher, which leverages vector-based attention to model relevance among all keypoints and achieves long-range context aggregation in an efficient and effective manner. A relative position encoding is applied to each SlimFormer so as to explicitly disclose relative distance information, further improving the representation of keypoints. A layer-scale strategy is also employed in each SlimFormer to enable the network to assimilate message exchange from the residual block adaptively, thus allowing it to simulate the human behaviour that humans can acquire different matching cues each time they scan an image pair. To facilitate a better adaption of the SlimFormer, we introduce a Feature Transition Module (FTM) to ensure a smooth transition in feature scopes with different receptive fields. By interleaving the self- and cross-SlimFormer multiple times, DeepMatcher can easily establish pixel-wise dense matches at coarse level. Finally, we perceive the match refinement as a combination of classification and regression problems and design Fine Matches Module to predict confidence and offset concurrently, thereby generating robust and accurate matches. Experimentally, we show that DeepMatcher significantly outperforms the state-of-the-art methods on several benchmarks, demonstrating the superior matching capability of DeepMatcher.
翻译:图像之间本地特性匹配仍然是一项艰巨的任务, 特别是在存在显著外观变异的情况下, 例如极端观点变化。 在此工作中, 我们提议深马特尔, 以深马特尔为基地的深变异网络, 其基础是我们对本地特性匹配的调查, 以检测无检测方法为基础。 关键的洞察力是, 本地特性匹配器与深层相匹配, 能够捕捉更多的人类直观和更简单到匹配的特性。 基于此, 我们提议为深马特尔专门设置一个缩影变异器( SlimFormer ), 它将基于矢量的注意力用于所有关键点之间的模型相关性, 并以高效和有效的方式实现远程背景组合。 将相对位置编码应用到每个深度变异网络, 明确披露相对的距离信息, 进一步改进关键点的表示。 每个SlimterrialForloral 战略也用于让网络从残余块中吸收信息交换, 从而可以模拟人类的行为, 每次扫描图像配值时, 以不同的方式显示匹配的信号。 为了更精确的更精确的精确的精度, 更精确的精确的精确的精确的精确的精确的精确的精确的精确的精确的精确的精确的精度, 。, 我们在演示的精度的精度的精度的精度的精度的精度的精度的精度的精度的精度的精度的精度的精度的精度的精度的精度的精度的精度的精度的精度的精度的精度的精度, 我们度的精度的精度, 。