Cloud cover in multispectral imagery (MSI) significantly hinders early-season crop mapping by corrupting spectral information. Existing Vision Transformer(ViT)-based time-series reconstruction methods, like SMTS-ViT, often employ coarse temporal embeddings that aggregate entire sequences, causing substantial information loss and reducing reconstruction accuracy. To address these limitations, a Video Vision Transformer (ViViT)-based framework with temporal-spatial fusion embedding for MSI reconstruction in cloud-covered regions is proposed in this study. Non-overlapping tubelets are extracted via 3D convolution with constrained temporal span $(t=2)$, ensuring local temporal coherence while reducing cross-day information degradation. Both MSI-only and SAR-MSI fusion scenarios are considered during the experiments. Comprehensive experiments on 2020 Traill County data demonstrate notable performance improvements: MTS-ViViT achieves a 2.23\% reduction in MSE compared to the MTS-ViT baseline, while SMTS-ViViT achieves a 10.33\% improvement with SAR integration over the SMTS-ViT baseline. The proposed framework effectively enhances spectral reconstruction quality for robust agricultural monitoring.
翻译:多光谱影像(MSI)中的云覆盖会严重破坏光谱信息,显著阻碍早期作物制图。现有基于视觉Transformer(ViT)的时间序列重建方法(如SMTS-ViT)常采用粗粒度的时间嵌入来聚合整个序列,导致大量信息损失并降低重建精度。为克服这些局限,本研究提出一种基于视频视觉Transformer(ViViT)的时空融合嵌入框架,用于云覆盖区域的多光谱影像重建。通过约束时间跨度(t=2)的三维卷积提取非重叠管状体,在确保局部时间连贯性的同时减少跨日信息退化。实验中同时考虑了纯MSI与SAR-MSI融合两种场景。基于2020年Traill County数据的综合实验表明性能显著提升:MTS-ViViT相较于MTS-ViT基线实现了2.23%的均方误差降低,而SMTS-ViViT通过SAR融合相较于SMTS-ViT基线实现了10.33%的改进。所提框架有效提升了光谱重建质量,为鲁棒的农业监测提供了支撑。