Remote sensing scene classification has been extensively studied for its critical roles in geological survey, oil exploration, traffic management, earthquake prediction, wildfire monitoring, and intelligence monitoring. In the past, the Machine Learning (ML) methods for performing the task mainly used the backbones pretrained in the manner of supervised learning (SL). As Masked Image Modeling (MIM), a self-supervised learning (SSL) technique, has been shown as a better way for learning visual feature representation, it presents a new opportunity for improving ML performance on the scene classification task. This research aims to explore the potential of MIM pretrained backbones on four well-known classification datasets: Merced, AID, NWPU-RESISC45, and Optimal-31. Compared to the published benchmarks, we show that the MIM pretrained Vision Transformer (ViTs) backbones outperform other alternatives (up to 18% on top 1 accuracy) and that the MIM technique can learn better feature representation than the supervised learning counterparts (up to 5% on top 1 accuracy). Moreover, we show that the general-purpose MIM-pretrained ViTs can achieve competitive performance as the specially designed yet complicated Transformer for Remote Sensing (TRS) framework. Our experiment results also provide a performance baseline for future studies.
翻译:遥感场景分类一直以来在地质勘探、油田开发、交通管理、地震预测、野火监测和情报监控等诸多领域扮演着关键角色。过去,机器学习(ML)方法主要使用基于监督学习(SL)的预训练模型进行此类任务的处理。随着自监督学习(SSL)技术中的遮蔽图像建模(MIM)被证明是一种更好的学习视觉特征表示的方式,这为改善ML在场景分类任务上的性能提供了新的机会。本研究旨在探索经过MIM预训练的骨干网络在四个知名的分类数据集:Merced,AID,NWPU-RESISC45和Optimal-31上的潜力。与已发布的基准结果相比,我们展示了MIM预训练的ViTs骨干网络优于其他替代方案(在top 1准确度上高达18%),并且MIM技术可以学习比监督学习对应物(在top 1准确度上高达5%)更好的特征表示。此外,我们还展示了通用的MIM预训练ViTs可以实现与专门设计但复杂的远程感知变压器(TRS)框架具有竞争力的性能。我们的实验结果也为未来的研究提供了性能基准。