"Thinking with Text" and "Thinking with Images" paradigm significantly improve the reasoning ability of large language models (LLMs) and Vision Language Models (VLMs). However, these paradigms have inherent limitations. (1) Images capture only single moments and fail to represent dynamic processes or continuous changes, and (2) The separation of text and vision as distinct modalities, hindering unified multimodal understanding and generation. To overcome these limitations, we introduce "Thinking with Video", a new paradigm that leverages video generation models, such as Sora-2, to bridge visual and textual reasoning in a unified temporal framework. To support this exploration, we developed the Video Thinking Benchmark (VideoThinkBench). VideoThinkBench encompasses two task categories: (1) vision-centric tasks (e.g., Eyeballing Puzzles), and (2) text-centric tasks (e.g., subsets of GSM8K, MMMU). Our evaluation establishes Sora-2 as a capable reasoner. On vision-centric tasks, Sora-2 is generally comparable to state-of-the-art (SOTA) VLMs, and even surpasses VLMs on several tasks, such as Eyeballing Games. On text-centric tasks, Sora-2 achieves 92% accuracy on MATH, and 75.53% accuracy on MMMU. Furthermore, we systematically analyse the source of these abilities. We also find that self-consistency and in-context learning can improve Sora-2's performance. In summary, our findings demonstrate that the video generation model is the potential unified multimodal understanding and generation model, positions "thinking with video" as a unified multimodal reasoning paradigm.
翻译:“文本思维”与“图像思维”范式显著提升了大语言模型(LLMs)与视觉语言模型(VLMs)的推理能力。然而,这些范式存在固有局限:(1)图像仅能捕捉单一时刻,无法表征动态过程或连续变化;(2)文本与视觉作为分离模态,阻碍了统一的多模态理解与生成。为克服这些局限,我们提出“视频思维”这一新范式,利用视频生成模型(如Sora-2)在统一的时序框架中桥接视觉与文本推理。为支持此探索,我们构建了视频思维基准(VideoThinkBench),涵盖两类任务:(1)视觉中心任务(如目测谜题);(2)文本中心任务(如GSM8K、MMMU的子集)。评估结果表明Sora-2具备优秀的推理能力:在视觉中心任务上,其整体性能与前沿VLMs相当,且在目测游戏等任务中超越VLMs;在文本中心任务上,Sora-2在MATH数据集准确率达92%,在MMMU达75.53%。我们进一步系统分析了这些能力的来源,并发现自洽性与上下文学习能提升Sora-2的性能。总之,本研究证明视频生成模型有望成为统一的多模态理解与生成模型,确立“视频思维”作为统一的多模态推理范式。