With the exacerbation of the biodiversity and climate crises, macroecological pursuits such as global biodiversity mapping become more urgent. Remote sensing offers a wealth of Earth observation data for ecological studies, but the scarcity of labeled datasets remains a major challenge. Recently, self-supervised learning has enabled learning representations from unlabeled data, triggering the development of pretrained geospatial models with generalizable features. However, these models are often trained on datasets biased toward areas of high human activity, leaving entire ecological regions underrepresented. Additionally, while some datasets attempt to address seasonality through multi-date imagery, they typically follow calendar seasons rather than local phenological cycles. To better capture vegetation seasonality at a global scale, we propose a simple phenology-informed sampling strategy and introduce corresponding SSL4Eco, a multi-date Sentinel-2 dataset, on which we train an existing model with a season-contrastive objective. We compare representations learned from SSL4Eco against other datasets on diverse ecological downstream tasks and demonstrate that our straightforward sampling method consistently improves representation quality, highlighting the importance of dataset construction. The model pretrained on SSL4Eco reaches state of the art performance on 7 out of 8 downstream tasks spanning (multi-label) classification and regression. We release our code, data, and model weights to support macroecological and computer vision research at https://github.com/PlekhanovaElena/ssl4eco.
翻译:随着生物多样性与气候危机的加剧,全球生物多样性制图等宏观生态学任务变得愈发紧迫。遥感技术为生态研究提供了丰富的地球观测数据,但标记数据的稀缺性仍是主要挑战。近年来,自监督学习使得从无标记数据中学习表征成为可能,推动了具有可泛化特征的预训练地理空间模型的发展。然而,这些模型通常在偏向人类活动密集区域的数据集上进行训练,导致许多完整生态区域代表性不足。此外,尽管部分数据集尝试通过多时相影像处理季节性,但它们通常遵循日历季节而非本地物候周期。为在全球尺度上更好地捕捉植被季节性,我们提出了一种简单的物候感知采样策略,并引入了相应的多时相Sentinel-2数据集SSL4Eco。基于该数据集,我们采用季节对比目标对现有模型进行训练。通过在多样化的生态下游任务中将SSL4Eco学习到的表征与其他数据集进行对比,我们证明这种简洁的采样方法能持续提升表征质量,凸显了数据集构建的重要性。在SSL4Eco上预训练的模型在涵盖(多标签)分类与回归的8项下游任务中,有7项达到了最先进的性能水平。我们在https://github.com/PlekhanovaElena/ssl4eco 发布代码、数据及模型权重,以支持宏观生态学与计算机视觉研究。