The Transformer architecture has witnessed a rapid development in recent years, outperforming the CNN architectures in many computer vision tasks, as exemplified by the Vision Transformers (ViT) for image classification. However, existing visual transformer models aim to extract semantic information for high-level tasks, such as classification and detection.These methods ignore the importance of the spatial resolution of the input image, thus sacrificing the local correlation information of neighboring pixels. In this paper, we propose a Patch Pyramid Transformer(PPT) to effectively address the above issues.Specifically, we first design a Patch Transformer to transform the image into a sequence of patches, where transformer encoding is performed for each patch to extract local representations. In addition, we construct a Pyramid Transformer to effectively extract the non-local information from the entire image. After obtaining a set of multi-scale, multi-dimensional, and multi-angle features of the original image, we design the image reconstruction network to ensure that the features can be reconstructed into the original input. To validate the effectiveness, we apply the proposed Patch Pyramid Transformer to image fusion tasks. The experimental results demonstrate its superior performance, compared to the state-of-the-art fusion approaches, achieving the best results on several evaluation indicators. Thanks to the underlying representational capacity of the PPT network, it can directly be applied to different image fusion tasks without redesigning or retraining the network.
翻译:近年来,变异器结构经历了快速发展,在许多计算机愿景任务中超过了CNN架构,如图像分类的View变异器(View Trangers)就是例证。然而,现有的视觉变异器模型旨在为高级别任务(如分类和检测)提取语义信息。这些方法忽视了输入图像的空间分辨率的重要性,从而牺牲了相邻像素的当地相关性信息。在本文中,我们建议建立一个Patch Pyramid变异器(PPPT),以有效解决上述问题。具体地说,我们首先设计一个修补变异器,将图像转换成一个补乱序列,对每个补变异器进行变异编码,以提取本地表达。此外,我们建造了一台变异变异器,以便从整个图像中有效地提取非本地信息。在获得一套多尺度、多维度和多角度的原始图像特征后,我们设计了图像重建网络的网络,以确保这些特征能够被重建成原始输入。为了验证效果,我们应用了拟议的Prib Pyrimider变异变变变变异器,对图像的网络进行一些图像变异性测试,在图像的图像变异性工作上显示能力上显示。