Modern Large Multimodal Models (LMMs) have demonstrated extraordinary ability in static image and single-state spatial-temporal understanding. However, their capacity to comprehend the dynamic changes of objects within a shared spatial context between two distinct video observations, remains largely unexplored. This ability to reason about transformations within a consistent environment is particularly crucial for advancements in the field of spatial intelligence. In this paper, we introduce $M^3-Verse$, a Multi-Modal, Multi-State, Multi-Dimensional benchmark, to formally evaluate this capability. It is built upon paired videos that provide multi-perspective observations of an indoor scene before and after a state change. The benchmark contains a total of 270 scenes and 2,932 questions, which are categorized into over 50 subtasks that probe 4 core capabilities. We evaluate 16 state-of-the-art LMMs and observe their limitations in tracking state transitions. To address these challenges, we further propose a simple yet effective baseline that achieves significant performance improvements in multi-state perception. $M^3-Verse$ thus provides a challenging new testbed to catalyze the development of next-generation models with a more holistic understanding of our dynamic visual world. You can get the construction pipeline from https://github.com/Wal-K-aWay/M3-Verse_pipeline and full benchmark data from https://www.modelscope.cn/datasets/WalKaWay/M3-Verse.
翻译:现代大型多模态模型(LMMs)在静态图像和单状态时空理解方面展现了非凡能力。然而,它们理解两个不同视频观测之间共享空间背景下物体动态变化的能力,在很大程度上仍未得到探索。这种在一致环境中推理变化的能力对于空间智能领域的进步尤为关键。本文提出M^3-Verse——一个多模态、多状态、多维度的基准测试,以系统评估该能力。该基准建立在成对视频基础上,这些视频提供了室内场景在状态变化前后的多视角观测。基准共包含270个场景和2,932个问题,被归类为涵盖4项核心能力的50余个子任务。我们评估了16个前沿大型多模态模型,发现它们在跟踪状态转换方面存在局限。为应对这些挑战,我们进一步提出一种简单而有效的基线方法,在多状态感知任务中实现了显著的性能提升。M^3-Verse由此为催化下一代模型的发展提供了一个具有挑战性的新测试平台,推动模型对我们动态视觉世界形成更全面的理解。您可通过https://github.com/Wal-K-aWay/M3-Verse_pipeline获取构建流程,并通过https://www.modelscope.cn/datasets/WalKaWay/M3-Verse获取完整基准数据。