Three-dimensional geospatial analysis is critical for applications in urban planning, climate adaptation, and environmental assessment. However, current methodologies depend on costly, specialized sensors, such as LiDAR and multispectral sensors, which restrict global accessibility. Additionally, existing sensor-based and rule-driven methods struggle with tasks requiring the integration of multiple 3D cues, handling diverse queries, and providing interpretable reasoning. We present Geo3DVQA, a comprehensive benchmark that evaluates vision-language models (VLMs) in height-aware 3D geospatial reasoning from RGB imagery alone. Unlike conventional sensor-based frameworks, Geo3DVQA emphasizes realistic scenarios integrating elevation, sky view factors, and land cover patterns. The benchmark comprises 110k curated question-answer pairs across 16 task categories, including single-feature inference, multi-feature reasoning, and application-level analysis. Through a systematic evaluation of ten state-of-the-art VLMs, we reveal fundamental limitations in RGB-to-3D spatial reasoning. Our results further show that domain-specific instruction tuning consistently enhances model performance across all task categories, including height-aware and open-ended, application-oriented reasoning. Geo3DVQA provides a unified, interpretable framework for evaluating RGB-based 3D geospatial reasoning and identifies key challenges and opportunities for scalable 3D spatial analysis. The code and data are available at https://github.com/mm1129/Geo3DVQA.
翻译:三维地理空间分析对于城市规划、气候适应和环境评估等应用至关重要。然而,当前方法依赖于昂贵且专业的传感器,如激光雷达和多光谱传感器,这限制了全球范围内的可及性。此外,现有的基于传感器和规则驱动的方法在处理需要整合多种三维线索、应对多样化查询以及提供可解释推理的任务时面临困难。我们提出了Geo3DVQA,这是一个综合性基准,用于评估视觉语言模型仅基于RGB影像进行高度感知的三维地理空间推理的能力。与传统的基于传感器的框架不同,Geo3DVQA强调整合高程、天空可视因子和土地覆盖模式的真实场景。该基准包含16个任务类别中精心策划的11万个问答对,涵盖单特征推断、多特征推理和应用级分析。通过对十个最先进的视觉语言模型进行系统评估,我们揭示了RGB到三维空间推理中的基本局限性。我们的结果进一步表明,领域特定的指令微调能持续提升模型在所有任务类别中的性能,包括高度感知和开放式的、面向应用的推理。Geo3DVQA为评估基于RGB的三维地理空间推理提供了一个统一、可解释的框架,并指出了可扩展三维空间分析的关键挑战与机遇。代码和数据可在 https://github.com/mm1129/Geo3DVQA 获取。