Understanding environmental changes from remote sensing imagery is vital for climate resilience, urban planning, and ecosystem monitoring. Yet, current vision language models (VLMs) overlook causal signals from environmental sensors, rely on single-source captions prone to stylistic bias, and lack interactive scenario-based reasoning. We present ChatENV, the first interactive VLM that jointly reasons over satellite image pairs and real-world sensor data. Our framework: (i) creates a 177k-image dataset forming 152k temporal pairs across 62 land-use classes in 197 countries with rich sensor metadata (e.g., temperature, PM10, CO); (ii) annotates data using GPT4o and Gemini 2.0 for stylistic and semantic diversity; and (iii) fine-tunes Qwen-2.5-VL using efficient Low-Rank Adaptation (LoRA) adapters for chat purposes. ChatENV achieves strong performance in temporal and "what-if" reasoning (e.g., BERTF1 0.902) and rivals or outperforms state-of-the-art temporal models, while supporting interactive scenario-based analysis. This positions ChatENV as a powerful tool for grounded, sensor-aware environmental monitoring.
翻译:从遥感影像理解环境变化对于气候韧性、城市规划和生态系统监测至关重要。然而,当前的视觉语言模型忽略了来自环境传感器的因果信号,依赖于易受风格偏差影响的单源描述,且缺乏基于交互式场景的推理能力。我们提出了ChatENV,首个能够联合推理卫星图像对和真实世界传感器数据的交互式视觉语言模型。我们的框架:(i)创建了一个包含17.7万张图像的数据集,涵盖197个国家、62个土地利用类别,形成了15.2万个时间序列图像对,并附有丰富的传感器元数据(如温度、PM10、CO);(ii)使用GPT4o和Gemini 2.0对数据进行标注,以确保风格和语义的多样性;(iii)为对话目的,使用高效的低秩自适应适配器对Qwen-2.5-VL模型进行微调。ChatENV在时序推理和“假设”推理(例如,BERTF1得分0.902)方面表现出色,可与最先进的时序模型相媲美甚至超越,同时支持基于交互式场景的分析。这使ChatENV成为一个强大的、基于真实传感器数据的环境监测工具。