Many current deep learning approaches make extensive use of backbone networks pre-trained on large datasets like ImageNet, which are then fine-tuned to perform a certain task. In remote sensing, the lack of comparable large annotated datasets and the wide diversity of sensing platforms impedes similar developments. In order to contribute towards the availability of pre-trained backbone networks in remote sensing, we devise a self-supervised approach for pre-training deep neural networks. By exploiting the correspondence between geo-tagged audio recordings and remote sensing imagery, this is done in a completely label-free manner, eliminating the need for laborious manual annotation. For this purpose, we introduce the SoundingEarth dataset, which consists of co-located aerial imagery and audio samples all around the world. Using this dataset, we then pre-train ResNet models to map samples from both modalities into a common embedding space, which encourages the models to understand key properties of a scene that influence both visual and auditory appearance. To validate the usefulness of the proposed approach, we evaluate the transfer learning performance of pre-trained weights obtained against weights obtained through other means. By fine-tuning the models on a number of commonly used remote sensing datasets, we show that our approach outperforms existing pre-training strategies for remote sensing imagery. The dataset, code and pre-trained model weights will be available at https://github.com/khdlr/SoundingEarth.
翻译:许多现行深层学习方法广泛利用在图像网络等大型数据集上预先培训的骨干网络,然后对骨干网络进行微调,以完成某些任务。在遥感方面,缺乏可比的大型附加说明的数据集和广泛的遥感平台,阻碍了类似的发展。为了帮助提供预先培训的骨干网络进行遥感,我们为培训前深层神经网络设计了一种自我监督的方法。通过利用地理标记录音和遥感图像之间的通信,这是完全无标签的方式完成的,从而消除了人工手动说明的必要性。为此,我们引入了“探测地球”数据集,该数据集由世界各地共同部署的航空图像和音频样本组成。利用这一数据集,我们先开发ResNet模型,将两种模式的样本映射成一个共同的嵌入空间,这鼓励模型了解影响视觉和听觉外观的场景的关键特性。为了验证拟议方法的效用,我们将评估根据通过其他手段获得的重量获得的经过预先培训的重量学习的重量的转移性能。我们通过微调的遥感方法,将现有数据模型用于遥感前的模型的模型。我们使用在遥感模型上的现有数字。