Remote sensing imagery provides comprehensive views of the Earth, where different sensors collect complementary data at different spatial scales. Large, pretrained models are commonly finetuned with imagery that is heavily augmented to mimic different conditions and scales, with the resulting models used for various tasks with imagery from a range of spatial scales. Such models overlook scale-specific information in the data. In this paper, we present Scale-MAE, a pretraining method that explicitly learns relationships between data at different, known scales throughout the pretraining process. Scale-MAE pretrains a network by masking an input image at a known input scale, where the area of the Earth covered by the image determines the scale of the ViT positional encoding, not the image resolution. Scale-MAE encodes the masked image with a standard ViT backbone, and then decodes the masked image through a bandpass filter to reconstruct low/high frequency images at lower/higher scales. We find that tasking the network with reconstructing both low/high frequency images leads to robust multiscale representations for remote sensing imagery. Scale-MAE achieves an average of a $5.0\%$ non-parametric kNN classification improvement across eight remote sensing datasets compared to current state-of-the-art and obtains a $0.9$ mIoU to $3.8$ mIoU improvement on the SpaceNet building segmentation transfer task for a range of evaluation scales.
翻译:大型的、经过预先培训的模型通常用图像进行微调,以模拟各种条件和尺度,并用各种空间尺度的图像进行模拟。这些模型忽略了数据中与比例有关的信息。在本文中,我们介绍了Segle-MAE,这是在整个培训前阶段明确学习不同尺度数据之间关系的一种预培训方法。规模MAE先于网络,在已知输入尺度上掩盖输入图像,而图像覆盖的地球区域决定ViT定位编码的尺度,而不是图像分辨率。规模MAE用标准ViT主干线编码遮蔽图像,然后通过带宽过滤器解码遮蔽图像,以便在低/高尺度上重建低/高频图像。我们发现,在重建低/高频图像的网络后,遥感图像的多尺度显示,在图像中,图像覆盖覆盖的地球面积决定了ViT位置编码的尺度,而不是图像分辨率分辨率解码。规模-MAE将遮盖图像编码成标准ViT骨架,然后通过带过滤器过滤器解码图像,以便在低/高频率图像上重建低/高频图像的多尺度图象显示。 比例为平均5.0.8美元,在当前的空间结构上改进了0.9美元,在8美元至非偏差宽宽平平平平面平面平段段任务分类上,在当前的空间建筑上,在8美元上取得了对8美元对8美元对面的图像进行空间的图像的平距值任务段进行空间测量。