We propose a novel cost aggregation network, called Cost Aggregation with Transformers (CATs), to find dense correspondences between semantically similar images with additional challenges posed by large intra-class appearance and geometric variations. Compared to previous hand-crafted or CNN-based methods addressing the cost aggregation stage, which either lack robustness to severe deformations or inherit the limitation of CNNs that fail to discriminate incorrect matches due to limited receptive fields, CATs explore global consensus among initial correlation map with the help of some architectural designs that allow us to exploit full potential of self-attention mechanism. Specifically, we include appearance affinity modelling to disambiguate the initial correlation maps and multi-level aggregation to benefit from hierarchical feature representations within Transformer-based aggregator, and combine with swapping self-attention and residual connections not only to enforce consistent matching, but also to ease the learning process. We conduct experiments to demonstrate the effectiveness of the proposed model over the latest methods and provide extensive ablation studies. Code and trained models will be made available at https://github.com/SunghwanHong/CATs.
翻译:我们提议建立一个新的成本汇总网络,称为“与变异器的成本聚合”,以寻找在结构上相似的图像与大型类内外貌和几何差异带来的额外挑战之间的密集对应关系。与以前针对成本汇总阶段的手工制作或有线电视新闻网方法相比,这些方法要么对严重变形缺乏稳健性,要么继承了由于有限的可接受领域而不能区分不正确匹配的CNN的限制。 CAT在一些建筑设计的帮助下,探索初始相关地图之间的全球共识,使我们能够充分利用自留机制的潜力。具体地说,我们包括在初始相关地图和多层次汇总上展示亲近性模型,以便从基于变异器的聚合器中的等级特征展示中受益,同时进行自我注意和剩余联系的互换,不仅是为了执行一致的匹配,而且是为了便利学习过程。我们进行实验,以展示拟议模型相对于最新方法的有效性并提供广泛的反动研究。我们将在https://github.com/SunghwanHong/CATs上提供守则和经过培训的模式。