Generating robust and reliable correspondences across images is a fundamental task for a diversity of applications. To capture context at both global and local granularity, we propose ASpanFormer, a Transformer-based detector-free matcher that is built on hierarchical attention structure, adopting a novel attention operation which is capable of adjusting attention span in a self-adaptive manner. To achieve this goal, first, flow maps are regressed in each cross attention phase to locate the center of search region. Next, a sampling grid is generated around the center, whose size, instead of being empirically configured as fixed, is adaptively computed from a pixel uncertainty estimated along with the flow map. Finally, attention is computed across two images within derived regions, referred to as attention span. By these means, we are able to not only maintain long-range dependencies, but also enable fine-grained attention among pixels of high relevance that compensates essential locality and piece-wise smoothness in matching tasks. State-of-the-art accuracy on a wide range of evaluation benchmarks validates the strong matching capability of our method.
翻译:为了在全球和本地颗粒度上捕捉环境,我们提议AspanFormer, 这是一种基于变异器的探测器, 以分层关注结构为基础, 采用一种新的关注操作, 能够以自我适应的方式调整注意力。 为了实现这一目标, 首先, 流图在每个交叉关注阶段都退缩, 以定位搜索区域中心。 其次, 在中心周围生成一个取样网格, 其大小不是根据经验配置为固定的, 而是根据流动地图估算的像素不确定性进行调整。 最后, 将注意力从衍生区域内的两种图像中计算, 被称为注意范围。 通过这些手段, 我们不仅能够保持远程依赖性, 而且能够使高度相关的像素得到细微的注意, 以弥补基本地点和在匹配任务时的平滑性。 一系列广泛的评估基准的精确度, 证实了我们方法的强大匹配能力。