Vision-language models (VLM) excel at general understanding yet remain weak at dynamic spatial reasoning (DSR), i.e., reasoning about the evolvement of object geometry and relationship in 3D space over time, largely due to the scarcity of scalable 4D-aware training resources. To bridge this gap across aspects of dataset, benchmark and model, we introduce DSR Suite. First, we propose an automated pipeline that generates multiple-choice question-answer pairs from in-the-wild videos for DSR. By leveraging modern vision foundation models, the pipeline extracts rich geometric and motion information, including camera poses, local point clouds, object masks, orientations, and 3D trajectories. These geometric cues enable the construction of DSR-Train for learning and further human-refined DSR-Bench for evaluation. Compared with previous works, our data emphasize (i) in-the-wild video sources, (ii) object- and scene-level 3D requirements, (iii) viewpoint transformations, (iv) multi-object interactions, and (v) fine-grained, procedural answers. Beyond data, we propose a lightweight Geometry Selection Module (GSM) to seamlessly integrate geometric priors into VLMs, which condenses question semantics and extracts question-relevant knowledge from pretrained 4D reconstruction priors into a compact set of geometry tokens. This targeted extraction avoids overwhelming the model with irrelevant knowledge. Experiments show that integrating DSR-Train and GSM into Qwen2.5-VL-7B significantly enhances its dynamic spatial reasoning capability, while maintaining accuracy on general video understanding benchmarks.
翻译:视觉语言模型(VLM)在通用理解方面表现出色,但在动态空间推理(DSR)方面仍然薄弱,即对三维空间中物体几何形状与关系随时间演变的推理能力不足,这主要源于可扩展的四维感知训练资源的稀缺。为了在数据集、基准测试和模型等多个层面弥合这一差距,我们引入了DSR套件。首先,我们提出一个自动化流程,能够从真实世界视频中生成用于DSR的多选问答对。该流程通过利用现代视觉基础模型,提取丰富的几何与运动信息,包括相机位姿、局部点云、物体掩码、朝向以及三维轨迹。这些几何线索使得我们能够构建用于学习的DSR-Train数据集,以及进一步经人工精炼的用于评估的DSR-Bench基准。与先前工作相比,我们的数据强调(i)真实世界视频来源,(ii)物体与场景级别的三维需求,(iii)视角变换,(iv)多物体交互,以及(v)细粒度、过程性的答案。除了数据之外,我们提出一个轻量级的几何选择模块(GSM),以将几何先验无缝集成到VLM中。该模块能够浓缩问题语义,并从预训练的四维重建先验中提取与问题相关的知识,将其编码为一组紧凑的几何令牌。这种有针对性的提取避免了无关知识对模型的干扰。实验表明,将DSR-Train和GSM集成到Qwen2.5-VL-7B模型中,能显著提升其动态空间推理能力,同时保持在通用视频理解基准上的准确性。