We present a novel architecture for dense correspondence. The current state-of-the-art are Transformer-based approaches that focus on either feature descriptors or cost volume aggregation. However, they generally aggregate one or the other but not both, though joint aggregation would boost each other by providing information that one has but other lacks, i.e., structural or semantic information of an image, or pixel-wise matching similarity. In this work, we propose a novel Transformer-based network that interleaves both forms of aggregations in a way that exploits their complementary information. Specifically, we design a self-attention layer that leverages the descriptor to disambiguate the noisy cost volume and that also utilizes the cost volume to aggregate features in a manner that promotes accurate matching. A subsequent cross-attention layer performs further aggregation conditioned on the descriptors of both images and aided by the aggregated outputs of earlier layers. We further boost the performance with hierarchical processing, in which coarser level aggregations guide those at finer levels. We evaluate the effectiveness of the proposed method on dense matching tasks and achieve state-of-the-art performance on all the major benchmarks. Extensive ablation studies are also provided to validate our design choices.
翻译:我们为密集的通信提供了一个新结构。 目前的先进技术是基于变异器的方法, 侧重于特征描述器或成本量汇总。 但是,它们一般地将一个或另一个合并, 而不是两者兼而有之, 虽然联合聚合会通过提供一个有但另一个缺乏的信息, 即图像的结构或语义信息, 或像素- 相匹配的相似性来相互促进。 在这项工作中, 我们提议一个基于变异器的新型网络, 将两种形式的集合以利用其补充信息的方式隔开来。 具体地说, 我们设计一个自我注意层, 利用描述器来淡化噪音的成本量, 并且也利用成本量来综合特征, 以促进准确匹配。 随后的一个交叉注意层, 对图像的描述器进行进一步整合, 并借助先前各层的汇总结果。 我们进一步提升了等级处理的性能, 通过这种分级组合来引导更细层次的组合选择。 我们评估了拟议方法在密度匹配任务上的有效性, 并实现了我们所有主要设计基准的状态验证。