Homography estimation is a basic computer vision task, which aims to obtain the transformation from multi-view images for image alignment. Unsupervised learning homography estimation trains a convolution neural network for feature extraction and transformation matrix regression. While the state-of-theart homography method is based on convolution neural networks, few work focuses on transformer which shows superiority in highlevel vision tasks. In this paper, we propose a strong-baseline model based on the Swin Transformer, which combines convolution neural network for local features and transformer module for global features. Moreover, a cross non-local layer is introduced to search the matched features within the feature maps coarsely. In the homography regression stage, we adopt an attention layer for the channels of correlation volume, which can drop out some weak correlation feature points. The experiment shows that in 8 Degree-of-Freedoms(DOFs) homography estimation our method overperforms the state-of-the-art method.
翻译:光学估算是一项基本的计算机视觉任务,目的是从多视图图像中获取图像对齐的转换。 不受监督的学习同系学估算为地貌提取和变异矩阵回归而训练一个进化神经网络。 虽然最先进的同系法基于共进神经网络,但很少注重在高层次视觉任务中表现出优势的变压器。 在本文中,我们提出了一个基于Swin变异器的强基模型,该变压器将本地特征的共进神经网络与全球特征的变压器模块结合起来。 此外,还引入了一个跨非本地的层,以粗略地搜索地在地貌图中匹配的特征。 在同系回归阶段,我们对相关体积的渠道采取关注层,这可以排除一些薄弱的关联特征点。 实验显示,在8度自由度(DOFs)的同系估算中,我们的方法超越了最先进的方法。