Geospatial Foundation Models (GeoFMs) are transforming Earth Observation (EO), but evaluation lacks standardized protocols. GEO-Bench-2 addresses this with a comprehensive framework spanning classification, segmentation, regression, object detection, and instance segmentation across 19 permissively-licensed datasets. We introduce ''capability'' groups to rank models on datasets that share common characteristics (e.g., resolution, bands, temporality). This enables users to identify which models excel in each capability and determine which areas need improvement in future work. To support both fair comparison and methodological innovation, we define a prescriptive yet flexible evaluation protocol. This not only ensures consistency in benchmarking but also facilitates research into model adaptation strategies, a key and open challenge in advancing GeoFMs for downstream tasks. Our experiments show that no single model dominates across all tasks, confirming the specificity of the choices made during architecture design and pretraining. While models pretrained on natural images (ConvNext ImageNet, DINO V3) excel on high-resolution tasks, EO-specific models (TerraMind, Prithvi, and Clay) outperform them on multispectral applications such as agriculture and disaster response. These findings demonstrate that optimal model choice depends on task requirements, data modalities, and constraints. This shows that the goal of a single GeoFM model that performs well across all tasks remains open for future research. GEO-Bench-2 enables informed, reproducible GeoFM evaluation tailored to specific use cases. Code, data, and leaderboard for GEO-Bench-2 are publicly released under a permissive license.
翻译:地理空间基础模型(GeoFMs)正在变革地球观测(EO)领域,但评估缺乏标准化协议。GEO-Bench-2通过一个涵盖分类、分割、回归、目标检测和实例分割的综合框架,跨越19个采用宽松许可的数据集,解决了这一问题。我们引入了“能力”分组,根据数据集共享的共同特征(如分辨率、波段、时间性)对模型进行排名。这使得用户能够识别哪些模型在每种能力上表现优异,并确定未来工作中需要改进的领域。为了支持公平比较和方法创新,我们定义了一个规定性但灵活的评估协议。这不仅确保了基准测试的一致性,还促进了模型适应策略的研究,这是推动GeoFMs用于下游任务的一个关键且开放的挑战。我们的实验表明,没有单一模型在所有任务中占主导地位,这证实了架构设计和预训练阶段所做选择的特异性。虽然基于自然图像预训练的模型(ConvNext ImageNet、DINO V3)在高分辨率任务上表现出色,但针对EO的特定模型(TerraMind、Prithvi和Clay)在多光谱应用(如农业和灾害响应)中表现更优。这些发现表明,最优模型选择取决于任务需求、数据模态和约束条件。这表明,一个在所有任务中均表现优异的单一GeoFM模型的目标仍是未来研究的开放课题。GEO-Bench-2支持针对特定用例的、可重复的、信息充分的GeoFM评估。GEO-Bench-2的代码、数据和排行榜已根据宽松许可公开发布。