The fusion of images taken by heterogeneous sensors helps to enrich the information and improve the quality of imaging. In this article, we present a hybrid model consisting of a convolutional encoder and a Transformer-based decoder to fuse multimodal images. In the encoder, a non-local cross-modal attention block is proposed to capture both local and global dependencies of multiple source images. A branch fusion module is designed to adaptively fuse the features of the two branches. We embed a Transformer module with linear complexity in the decoder to enhance the reconstruction capability of the proposed network. Qualitative and quantitative experiments demonstrate the effectiveness of the proposed method by comparing it with existing state-of-the-art fusion models. The source code of our work is available at https://github.com/pandayuanyu/HCFusion.
翻译:不同传感器所摄图像的融合有助于丰富信息,提高图像质量。在本篇文章中,我们提出了一个混合模型,其中包括一个革命编码器和一个基于变异器的解码器,以融合多式图像。在编码器中,建议建立一个非本地的跨模式关注区,以捕捉多种来源图像的当地和全球依赖性。一个分支融合单元的设计,以适应地融合两个分支的特征。我们在解码器中嵌入一个具有线性复杂性的变形器模块,以加强拟议网络的重建能力。定性和定量实验通过将它与现有的最新聚合模型进行比较,来证明拟议方法的有效性。我们工作的源代码见https://github.com/pandayuanyu/HCFusion。