FoundationMotion：视频中空间运动的自动标注与推理 (FoundationMotion: Auto-Labeling and Reasoning about Spatial Movement in Videos)

Motion understanding is fundamental to physical reasoning, enabling models to infer dynamics and predict future states. However, state-of-the-art models still struggle on recent motion benchmarks, primarily due to the scarcity of large-scale, fine-grained motion datasets. Existing motion datasets are often constructed from costly manual annotation, severely limiting scalability. To address this challenge, we introduce FoundationMotion, a fully automated data curation pipeline that constructs large-scale motion datasets. Our approach first detects and tracks objects in videos to extract their trajectories, then leverages these trajectories and video frames with Large Language Models (LLMs) to generate fine-grained captions and diverse question-answer pairs about motion and spatial reasoning. Using datasets produced by this pipeline, we fine-tune open-source models including NVILA-Video-15B and Qwen2.5-7B, achieving substantial improvements in motion understanding without compromising performance on other tasks. Notably, our models outperform strong closed-source baselines like Gemini-2.5 Flash and large open-source models such as Qwen2.5-VL-72B across diverse motion understanding datasets and benchmarks. FoundationMotion thus provides a scalable solution for curating fine-grained motion datasets that enable effective fine-tuning of diverse models to enhance motion understanding and spatial reasoning capabilities.

翻译：运动理解是物理推理的基础，它使模型能够推断动态并预测未来状态。然而，在最新的运动基准测试中，最先进的模型仍然表现不佳，这主要归因于大规模、细粒度运动数据集的稀缺。现有的运动数据集通常依赖于昂贵的人工标注构建，严重限制了可扩展性。为应对这一挑战，我们提出了FoundationMotion，一种全自动的数据构建流程，用于创建大规模运动数据集。我们的方法首先检测并跟踪视频中的物体以提取其轨迹，然后利用这些轨迹、视频帧以及大型语言模型（LLMs）生成关于运动和空间推理的细粒度描述和多样化的问答对。通过使用该流程生成的数据集，我们对包括NVILA-Video-15B和Qwen2.5-7B在内的开源模型进行微调，在不影响其他任务性能的前提下，实现了运动理解能力的显著提升。值得注意的是，我们的模型在多种运动理解数据集和基准测试中，超越了Gemini-2.5 Flash等强大的闭源基线模型以及Qwen2.5-VL-72B等大型开源模型。因此，FoundationMotion为构建细粒度运动数据集提供了一个可扩展的解决方案，能够有效微调多种模型，从而增强其运动理解和空间推理能力。

相关内容

MoDELS

关注 44

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/