In image fusion, images obtained from different sensors are fused to generate a single image with enhanced information. In recent years, state-of-the-art methods have adopted Convolution Neural Networks (CNNs) to encode meaningful features for image fusion. Specifically, CNN-based methods perform image fusion by fusing local features. However, they do not consider long-range dependencies that are present in the image. Transformer-based models are designed to overcome this by modeling the long-range dependencies with the help of self-attention mechanism. This motivates us to propose a novel Image Fusion Transformer (IFT) where we develop a transformer-based multi-scale fusion strategy that attends to both local and long-range information (or global context). The proposed method follows a two-stage training approach. In the first stage, we train an auto-encoder to extract deep features at multiple scales. In the second stage, multi-scale features are fused using a Spatio-Transformer (ST) fusion strategy. The ST fusion blocks are comprised of a CNN and a transformer branch which capture local and long-range features, respectively. Extensive experiments on multiple benchmark datasets show that the proposed method performs better than many competitive fusion algorithms. Furthermore, we show the effectiveness of the proposed ST fusion strategy with an ablation analysis. The source code is available at: https://github.com/Vibashan/Image-Fusion-Transformer.
翻译:在图像融合中,从不同传感器获得的图像被结合,以生成一个带有强化信息的单一图像。近年来,最先进的方法采用了革命神经网络(CNNs)来为图像融合编码有意义的特征。具体地说,以CNN为基础的方法通过使用本地特征来进行图像融合。然而,它们并不考虑图像中存在的远程依赖性。基于变异器的模型的设计,通过在自我注意机制的帮助下模拟长距离依赖性模型来克服这一点。这促使我们提出一个新的图像融合变异变异变异器(IFT),我们在那里开发一种基于变异器的多尺度融合战略,既针对本地信息,也针对全球背景。拟议的方法采用两阶段培训方法。但在第一阶段,我们不考虑图像中存在的远程依赖性成像。在第二个阶段,利用Spatio-Transformicret(ST)组合战略,将多尺度的图像融合变异形变异器(IFT)(IFT)组成了一个基于变异器的多级的多级组合变异式多级战略,其中分别展示了当地和变异式的变式变式的变式的变式模型,我们展示了多种变式的变式变式的变式的变式的变式的变式的变式的变式的变式的变式模型方法。