亿级基础模型在遥感图像中的应用 (A Billion-scale Foundation Model for Remote Sensing Images)

As the potential of foundation models in visual tasks has garnered significant attention, pretraining these models before downstream tasks has become a crucial step. The three key factors in pretraining foundation models are the pretraining method, the size of the pretraining dataset, and the number of model parameters. Recently, research in the remote sensing field has focused primarily on the pretraining method and the size of the dataset, with limited emphasis on the number of model parameters. This paper addresses this gap by examining the effect of increasing the number of model parameters on the performance of foundation models in downstream tasks such as rotated object detection and semantic segmentation. We pretrained foundation models with varying numbers of parameters, including 86M, 605.26M, 1.3B, and 2.4B, to determine whether performance in downstream tasks improved with an increase in parameters. To the best of our knowledge, this is the first billion-scale foundation model in the remote sensing field. Furthermore, we propose an effective method for scaling up and fine-tuning a vision transformer in the remote sensing field. To evaluate general performance in downstream tasks, we employed the DOTA v2.0 and DIOR-R benchmark datasets for rotated object detection, and the Potsdam and LoveDA datasets for semantic segmentation. Experimental results demonstrated that, across all benchmark datasets and downstream tasks, the performance of the foundation models and data efficiency improved as the number of parameters increased. Moreover, our models achieve the state-of-the-art performance on several datasets including DIOR-R, Postdam, and LoveDA.

翻译：随着基础模型在视觉任务中的巨大潜力引起了人们的广泛关注，对于其在下游任务之前进行预训练已成为至关重要的步骤。预训练基础模型的三个关键因素是预训练方法、预训练数据集的大小以及模型参数的数量。最近，遥感领域的研究主要集中在预训练方法和数据集大小上，但对模型参数数量的影响缺乏深入研究。本文旨在研究增加模型参数数量对基础模型在下游任务（如旋转物体检测和语义分割）中表现的影响。我们预训练了包括 86M、605.26M、1.3B 和 2.4B 参数的基础模型，以确定随着参数数量的增加，是否会提高下游任务的性能。据我们所知，这是遥感领域中首个亿级基础模型。此外，我们提出了一种有效的方法，在遥感领域中对视觉转换器进行扩展和微调。我们使用 DOTA v2.0 和 DIOR-R 基准数据集评估下游任务的通用性能，并使用 Potsdam 和 LoveDA 数据集进行语义分割。实验结果表明，在所有基准数据集和下游任务中，随着参数数量的增加，基础模型的性能和数据效率均得到了提高。此外，我们的模型在 DIOR-R、Postdam 和 LoveDA 等数据集上实现了最先进的性能。