Cross-view matching is fundamentally achieved through cross-attention mechanisms. However, matching of high-resolution images remains challenging due to the quadratic complexity and lack of explicit matching constraints in the existing cross-attention. This paper proposes an attention mechanism, MatchAttention, that dynamically matches relative positions. The relative position determines the attention sampling center of the key-value pairs given a query. Continuous and differentiable sliding-window attention sampling is achieved by the proposed BilinearSoftmax. The relative positions are iteratively updated through residual connections across layers by embedding them into the feature channels. Since the relative position is exactly the learning target for cross-view matching, an efficient hierarchical cross-view decoder, MatchDecoder, is designed with MatchAttention as its core component. To handle cross-view occlusions, gated cross-MatchAttention and a consistency-constrained loss are proposed. These two components collectively mitigate the impact of occlusions in both forward and backward passes, allowing the model to focus more on learning matching relationships. When applied to stereo matching, MatchStereo-B ranked 1st in average error on the public Middlebury benchmark and requires only 29ms for KITTI-resolution inference. MatchStereo-T can process 4K UHD images in 0.1 seconds using only 3GB of GPU memory. The proposed models also achieve state-of-the-art performance on KITTI 2012, KITTI 2015, ETH3D, and Spring flow datasets. The combination of high accuracy and low computational complexity makes real-time, high-resolution, and high-accuracy cross-view matching possible. Project page: https://github.com/TingmanYan/MatchAttention.
翻译:跨视图匹配本质上通过交叉注意力机制实现。然而,由于现有交叉注意力机制存在二次复杂度且缺乏显式匹配约束,高分辨率图像的匹配仍具挑战性。本文提出一种动态匹配相对位置的注意力机制——MatchAttention。相对位置决定了给定查询时键值对的注意力采样中心。通过提出的BilinearSoftmax实现了连续且可微的滑动窗口注意力采样。相对位置通过残差连接跨层迭代更新,并嵌入特征通道中。由于相对位置正是跨视图匹配的学习目标,本文设计了一个以MatchAttention为核心的高效分层跨视图解码器——MatchDecoder。为处理跨视图遮挡,提出了门控交叉MatchAttention和一致性约束损失。这两个组件共同在前向和反向传播中减轻遮挡影响,使模型更专注于学习匹配关系。应用于立体匹配时,MatchStereo-B在公开Middlebury基准测试中平均误差排名第一,且KITTI分辨率推理仅需29毫秒。MatchStereo-T仅使用3GB GPU内存即可在0.1秒内处理4K超高清图像。所提模型在KITTI 2012、KITTI 2015、ETH3D和Spring光流数据集上也达到了最先进的性能。高精度与低计算复杂度的结合使得实时、高分辨率、高精度的跨视图匹配成为可能。项目页面:https://github.com/TingmanYan/MatchAttention。