Current transformer-based change detection (CD) approaches either employ a pre-trained model trained on large-scale image classification ImageNet dataset or rely on first pre-training on another CD dataset and then fine-tuning on the target benchmark. This current strategy is driven by the fact that transformers typically require a large amount of training data to learn inductive biases, which is insufficient in standard CD datasets due to their small size. We develop an end-to-end CD approach with transformers that is trained from scratch and yet achieves state-of-the-art performance on four public benchmarks. Instead of using conventional self-attention that struggles to capture inductive biases when trained from scratch, our architecture utilizes a shuffled sparse-attention operation that focuses on selected sparse informative regions to capture the inherent characteristics of the CD data. Moreover, we introduce a change-enhanced feature fusion (CEFF) module to fuse the features from input image pairs by performing a per-channel re-weighting. Our CEFF module aids in enhancing the relevant semantic changes while suppressing the noisy ones. Extensive experiments on four CD datasets reveal the merits of the proposed contributions, achieving gains as high as 14.27\% in intersection-over-union (IoU) score, compared to the best-published results in the literature. Code is available at \url{https://github.com/mustansarfiaz/ScratchFormer}.
翻译:当前,基于Transformer的变化检测(CD)方法要么采用在大型图像分类ImageNet数据集上预先训练的模型,要么依赖于先在另一个CD数据集上进行预训练,然后再在目标基准测试中进行微调。这种现有策略的驱动因素是,Transformer通常需要大量的训练数据来学习归纳偏差,而标准CD数据集的大小不足以满足这一需求。我们开发了一种使用Transformer的端到端CD方法,该方法从头训练,但在四个公共基准测试中仍实现了最先进的性能。我们的架构不使用传统的自注意力,后者在从头训练时往往难以捕捉归纳偏差,而是利用一种分散稀疏的注意力操作,聚焦于选择的稀疏信息区域,以捕捉CD数据的固有特性。此外,我们还引入了一种变化增强的特征融合(CEFF)模块,通过进行逐通道重新加权,将输入图像对的特征融合起来。我们的CEFF模块有助于增强相关的语义变化,同时抑制噪声。在四个CD数据集上的大量实验表明了所提出贡献的优点,与文献中最佳发布结果相比,IoU分数获得了高达14.27%的增益。 代码可在 \url{https://github.com/mustansarfiaz/ScratchFormer}.找到。