Homography estimation is a basic computer vision task, which aims to obtain the transformation from multi-view images for image alignment. Unsupervised learning homography estimation trains a convolution neural network for feature extraction and transformation matrix regression. While the state-of-the-art homography method is based on convolution neural networks, few work focuses on transformer which shows superiority in high-level vision tasks. In this paper, we propose a strong-baseline model based on the Swin Transformer, which combines convolution neural network for local features and transformer module for global features. Moreover, a cross non-local layer is introduced to search the matched features within the feature maps coarsely.In the homography regression stage, we adopts an attention layer for the channels of correlation volume, which can drop out some weak correlation feature points. The experiment shows that in 8 Degree-of-Freedoms(DOFs) homography estimation our methods overperform the state-of-the-art method.
翻译:光学估算是一项基本的计算机视觉任务,目的是从多视图图像中获取图像对齐的转换。 不受监督的学习同系学估算为地貌提取和变异矩阵回归而训练一个进化神经网络。 虽然最先进的同系法以共进神经网络为基础,但很少注重在高层次视觉任务中表现出优势的变压器。 在本文中,我们提出了一个基于Swin变换器的强基模型,该变压器将本地特征的共进神经网络与全球特征的变压器模块结合起来。 此外,还引入了一个跨非本地层,以粗略地搜索地在地图中匹配的特征。 在同系回归阶段,我们对相关体积的渠道采用一个注意层,这些渠道可以抛出一些薄弱的关联特征。 实验显示,在8度自由度的同系估算中,我们的方法超过了最先进的方法。