In this paper we introduce the Temporo-Spatial Vision Transformer (TSViT), a fully-attentional model for general Satellite Image Time Series (SITS) processing based on the Vision Transformer (ViT). TSViT splits a SITS record into non-overlapping patches in space and time which are tokenized and subsequently processed by a factorized temporo-spatial encoder. We argue, that in contrast to natural images, a temporal-then-spatial factorization is more intuitive for SITS processing and present experimental evidence for this claim. Additionally, we enhance the model's discriminative power by introducing two novel mechanisms for acquisition-time-specific temporal positional encodings and multiple learnable class tokens. The effect of all novel design choices is evaluated through an extensive ablation study. Our proposed architecture achieves state-of-the-art performance, surpassing previous approaches by a significant margin in three publicly available SITS semantic segmentation and classification datasets. All model, training and evaluation codes are made publicly available to facilitate further research.
翻译:在本文中,我们介绍Temporo-Spioal Visavision 变异器(TSVIT),这是一个基于愿景变异器的通用卫星图像时间序列(SITS)处理的完全有意模式。TSVIT将SIT记录分成空间和时间上不重叠的补丁,这些补丁被象征性化,并随后由一个因子化的节奏空间变异器处理处理。我们争辩说,与自然图像相比,时间-当时空间因素化对于SITS的处理来说更具直观性,并为这一主张提供实验性证据。此外,我们通过引入两种新颖的机制,用于获取特定时间的时间定时定位编码和多个可学习的类符号,强化了模型的歧视性力量。所有新设计选择的效果都通过广泛的缩略图研究进行评估。我们拟议的结构取得了最新性的工作表现,在三个公开提供的SITS Semantic分解器和分类数据集中大大超出以往的做法。所有模型、培训和评价代码都公开提供,以便利进一步研究。