Tformer: 用于多模式皮肤损伤诊断的全聚变变压器 (TFormer: A throughout fusion transformer for multi-modal skin lesion diagnosis)

Multi-modal skin lesion diagnosis (MSLD) has achieved remarkable success by modern computer-aided diagnosis technology based on deep convolutions. However, the information aggregation across modalities in MSLD remains challenging due to severity unaligned spatial resolution (dermoscopic image and clinical image) and heterogeneous data (dermoscopic image and patients' meta-data). Limited by the intrinsic local attention, most recent MSLD pipelines using pure convolutions struggle to capture representative features in shallow layers, thus the fusion across different modalities is usually done at the end of the pipelines, even at the last layer, leading to an insufficient information aggregation. To tackle the issue, we introduce a pure transformer-based method, which we refer to as ``Throughout Fusion Transformer (TFormer)", for sufficient information intergration in MSLD. Different from the existing approaches with convolutions, the proposed network leverages transformer as feature extraction backbone, bringing more representative shallow features. We then carefully design a stack of dual-branch hierarchical multi-modal transformer (HMT) blocks to fuse information across different image modalities in a stage-by-stage way. With the aggregated information of image modalities, a multi-modal transformer post-fusion (MTP) block is designed to integrate features across image and non-image data. Such a strategy that information of the image modalities is firstly fused then the heterogeneous ones enables us to better divide and conquer the two major challenges while ensuring inter-modality dynamics are effectively modeled. Experiments conducted on the public Derm7pt dataset validate the superiority of the proposed method. Our TFormer outperforms other state-of-the-art methods. Ablation experiments also suggest the effectiveness of our designs.

翻译：多式皮肤损伤诊断(MSLD)通过基于深层融合的现代计算机辅助诊断技术(MSLD)取得了显著的成功。然而,由于空间分辨率(皮肤成像和临床图像)和多种数据(皮肤成像和病人元数据)的难度,MSLD各模式的信息汇总仍具有挑战性。受本地内在关注的限制,最新的MSLD管道使用纯混凝土来捕捉浅层的代表性特征,因此,不同模式的融合通常是在管道的末端完成,甚至最后一层,导致信息汇总不足。为了解决这一问题,我们引入了纯粹的变异器基础方法,我们称之为“Throrout Fusion变异器(皮肤成像和临床成像 ), 与现有的演变方法不同, 拟议的网络将变异器作为特征提取主干线, 带来更具代表性的浅色特征。我们随后仔细设计了一套双层级级多式多式变形变异器(HMTT)的组合, 导致信息汇总不足。为了解决这个问题,我们引入了一种纯粹的变异型变形变形变形变形变形变形模型的变形方法,我们当时的变形变形变形变形模型的变形模型的变形模型的变形模型的模型的模型, 也是一种非级变形模型的变形模型的变形方法。