In this paper we introduce the Temporo-Spatial Vision Transformer (TSViT), a fully-attentional model for general Satellite Image Time Series (SITS) processing based on the Vision Transformer (ViT). TSViT splits a SITS record into non-overlapping patches in space and time which are tokenized and subsequently processed by a factorized temporo-spatial encoder. We argue, that in contrast to natural images, a temporal-then-spatial factorization is more intuitive for SITS processing and present experimental evidence for this claim. Additionally, we enhance the model's discriminative power by introducing two novel mechanisms for acquisition-time-specific temporal positional encodings and multiple learnable class tokens. The effect of all novel design choices is evaluated through an extensive ablation study. Our proposed architecture achieves state-of-the-art performance, surpassing previous approaches by a significant margin in three publicly available SITS semantic segmentation and classification datasets. All model, training and evaluation codes are made publicly available to facilitate further research.
翻译:本论文介绍了基于视觉Transformer(ViT)的时间空间视觉Transformer(TSViT)模型,用于通用卫星图像时间序列(SITS)处理。TSViT将SITS记录分割成非重叠的空间和时间补丁,对其进行标记化,然后通过分解的时间空间编码器进行处理。与自然图像不同,我们认为一个时间-空间分解对于SITS处理更加直观,并通过实验证据支持了这个观点。此外,我们通过引入两种新的机制来获取时间位置编码和多个可学习类令牌,增强了模型的区分能力。通过广泛的消融研究评估了所有新设计选择的效果。我们提出的体系结构在三个公共的SITS语义分割和分类数据集中取得了显着的优势,超过了以前的方法。所有的模型、训练和评估代码都是公开的,以促进进一步的研究。