Vision Transformers (ViTs) have achieved remarkable success in standard RGB image processing tasks. However, applying ViTs to multi-channel imaging (MCI) data, e.g., for medical and remote sensing applications, remains a challenge. In particular, MCI data often consist of layers acquired from different modalities. Directly training ViTs on such data can obscure complementary information and impair the performance. In this paper, we introduce a simple yet effective pretraining framework for large-scale MCI datasets. Our method, named Isolated Channel ViT (IC-ViT), patchifies image channels individually and thereby enables pretraining for multimodal multi-channel tasks. We show that this channel-wise patchifying is a key technique for MCI processing. More importantly, one can pretrain the IC-ViT on single channels and finetune it on downstream multi-channel datasets. This pretraining framework captures dependencies between patches as well as channels and produces robust feature representation. Experiments on various tasks and benchmarks, including JUMP-CP and CHAMMI for cell microscopy imaging, and So2Sat-LCZ42 for satellite imaging, show that the proposed IC-ViT delivers 4-14 percentage points of performance improvement over existing channel-adaptive approaches. Further, its efficient training makes it a suitable candidate for large-scale pretraining of foundation models on heterogeneous data. Our code is available at https://github.com/shermanlian/IC-ViT.
翻译:视觉Transformer(ViTs)在标准RGB图像处理任务中取得了显著成功。然而,将其应用于多通道成像(MCI)数据(例如用于医学和遥感应用)仍面临挑战。具体而言,MCI数据通常包含从不同模态获取的图层。直接在此类数据上训练ViTs可能会掩盖互补信息并损害性能。本文针对大规模MCI数据集提出了一种简单而有效的预训练框架。我们的方法命名为孤立通道ViT(IC-ViT),通过对图像通道进行独立分块处理,实现了多模态多通道任务的预训练。我们证明这种逐通道分块技术是MCI处理的关键方法。更重要的是,研究者可以在单通道数据上预训练IC-ViT,并在下游多通道数据集上进行微调。该预训练框架能同时捕捉图像块间与通道间的依赖关系,并生成鲁棒的特征表示。在细胞显微成像任务(JUMP-CP和CHAMMI数据集)及卫星成像任务(So2Sat-LCZ42数据集)上的实验表明,所提出的IC-ViT相较于现有通道自适应方法实现了4-14个百分点的性能提升。此外,其高效的训练过程使其成为在异构数据上进行大规模基础模型预训练的合适候选方案。我们的代码公开于https://github.com/shermanlian/IC-ViT。