Deep learning has largely reshaped remote sensing (RS) research for aerial image understanding and made a great success. Nevertheless, most of the existing deep models are initialized with the ImageNet pretrained weights. Since natural images inevitably present a large domain gap relative to aerial images, probably limiting the finetuning performance on downstream aerial scene tasks. This issue motivates us to conduct an empirical study of remote sensing pretraining (RSP) on aerial images. To this end, we train different networks from scratch with the help of the largest RS scene recognition dataset up to now -- MillionAID, to obtain a series of RS pretrained backbones, including both convolutional neural networks (CNN) and vision transformers such as Swin and ViTAE, which have shown promising performance on computer vision tasks. Then, we investigate the impact of RSP on representative downstream tasks including scene recognition, semantic segmentation, object detection, and change detection using these CNN and vision transformer backbones. Empirical study shows that RSP can help deliver distinctive performances in scene recognition tasks and in perceiving RS related semantics such as "Bridge" and "Airplane". We also find that, although RSP mitigates the data discrepancies of traditional ImageNet pretraining on RS images, it may still suffer from task discrepancies, where downstream tasks require different representations from scene recognition tasks. These findings call for further research efforts on both large-scale pretraining datasets and effective pretraining methods. The codes and pretrained models will be released at https://github.com/ViTAE-Transformer/ViTAE-Transformer-Remote-Sensing.
翻译:深层学习在很大程度上改变了遥感(RS)对空中图像理解的研究,并取得了巨大成功。然而,大多数现有深层模型都是以图像网络预设的重量初始化的。由于自然图像不可避免地显示与空中图像相比的巨大领域差距,可能限制下游航空现场任务微调性能。这个问题促使我们进行空中图像遥感预培训(RSP)的经验性研究。为此,我们利用迄今为止最大的RS现场识别数据集(百万国际开发署),从零开始对不同的网络进行培训,以获得一系列RS预先训练的骨干,包括神经神经网络(CNN)和视觉变异器(Swin和VITAE),这些变异器在计算机视觉任务上表现良好。然后,我们调查RSP对具有代表性的下游任务的影响,包括现场识别、语系分解、物体探测,以及利用这些CNN和视觉变压前骨架进行感知觉检测。EPSP可以帮助在现场识别任务和感应RS相关语义学(例如“Bridge”和“Airtrade-Trading E)变压(RS-Trading laft laft laft laction laft laft laudal ) laudal laud Stal laudal task) 等任务中,我们还能任务需要这些变变的变压前任务需要这些变变压前的变变的变的变的变的变。我们数据。我们还可能数据表。我们数据分析任务,我们数据分析中的数据表。