重新思考视觉智能：基于视频预训练的启示 (Rethinking Visual Intelligence: Insights from Video Pretraining) - 专知论文

会员服务 ·

0

预训练 · 视频 · 偏置 · 视觉智能 · 归纳偏置 ·

Rethinking Visual Intelligence: Insights from Video Pretraining

翻译：重新思考视觉智能：基于视频预训练的启示

Pablo Acuaviva,Aram Davtyan,Mariam Hassan,Sebastian Stapf,Ahmad Rahimi,Alexandre Alahi,Paolo Favaro

from arxiv, Updated version from preprint arXiv:2506.07280 (Gen2Gen) focused on visual intelligence. This work can be considered as v2

Large language models (LLMs) have demonstrated that large-scale pretraining enables systems to adapt rapidly to new problems with little supervision in the language domain. This success, however, has not translated as effectively to the visual domain, where models, including LLMs, continue to struggle with compositional understanding, sample efficiency, and general-purpose problem-solving. We investigate Video Diffusion Models (VDMs) as a promising direction for bridging this gap. Pretraining on spatiotemporal data endows these models with strong inductive biases for structure and dynamics, which we hypothesize can support broad task adaptability. To test this, we design a controlled evaluation in which both a pretrained LLM and a pretrained VDM are equipped with lightweight adapters and presented with tasks in their natural modalities. Across benchmarks including ARC-AGI, ConceptARC, visual games, route planning, and cellular automata, VDMs demonstrate higher data efficiency than their language counterparts. Taken together, our results indicate that video pretraining offers inductive biases that support progress toward visual foundation models.

翻译：大型语言模型（LLMs）已证明，在大规模预训练下，系统能够在语言领域以极少监督快速适应新问题。然而，这一成功尚未在视觉领域得到同等有效的转化，包括LLMs在内的模型仍在组合理解、样本效率和通用问题解决方面面临挑战。本研究探讨视频扩散模型（VDMs）作为弥合这一差距的潜在方向。在时空数据上的预训练赋予这些模型对结构与动态的强归纳偏置，我们假设这有助于支持广泛的任务适应性。为验证此假设，我们设计了对照评估：将预训练的LLM与预训练的VDM分别配备轻量级适配器，并在其自然模态下执行任务。在包括ARC-AGI、ConceptARC、视觉游戏、路径规划和元胞自动机在内的多项基准测试中，VDM表现出比语言模型更高的数据效率。综合结果表明，视频预训练提供的归纳偏置有助于推动视觉基础模型的发展。

0

相关内容

预训练

在搭建网络模型时，需要随机初始化参数，然后开始训练网络，不断调整直到网络的损失越来越小。在训练的过程中，一开始初始化的参数会不断变化。当参数训练到比较好的时候就可以将训练模型的参数保存下来，以便训练好的模型可以在下次执行类似任务时获得较好的结果。

【NeurIPS2025】迈向开放世界的三维“物体性”学习

【NeurIPS2025】迈向开放世界的三维“物体性”学习

专知会员服务

11+阅读 · 10月21日

【CVPR2025】重新思考长时视频理解中的时序检索

【CVPR2025】重新思考长时视频理解中的时序检索

专知会员服务

13+阅读 · 4月6日

【ECCV2022】对比视觉Transformer的在线持续学习

【ECCV2022】对比视觉Transformer的在线持续学习

专知会员服务

23+阅读 · 2022年7月29日

【视觉和语言导航:任务、方法和未来方向的综述】Vision-and-Language Navigation: A Survey of Tasks, Methods, and Future Directions

【视觉和语言导航:任务、方法和未来方向的综述】Vision-and-Language Navigation: A Survey of Tasks, Methods, and Future Directions

专知会员服务

37+阅读 · 2022年3月25日

借助几何先验知识促进深度神经网络：综述 | Boosting Deep Neural Networks with Geometrical Prior Knowledge: A Survey

借助几何先验知识促进深度神经网络：综述 | Boosting Deep Neural Networks with Geometrical Prior Knowledge: A Survey

专知会员服务

29+阅读 · 2020年7月10日

AAAI 2022 | ProtGNN：自解释图神经网络

AAAI 2022 | ProtGNN：自解释图神经网络

专知

10+阅读 · 2022年2月28日

ICLR'21 | GNN联邦学习的新基准

ICLR'21 | GNN联邦学习的新基准

图与推荐

12+阅读 · 2021年11月15日

最新最全《深度元学习》2021综述论文，68页pdf，A Survey of Deep Meta-Learning

最新最全《深度元学习》2021综述论文，68页pdf，A Survey of Deep Meta-Learning

专知

11+阅读 · 2021年4月23日

论文浅尝 | GEOM-GCN: Geometric Graph Convolutional Networks

论文浅尝 | GEOM-GCN: Geometric Graph Convolutional Networks

开放知识图谱

14+阅读 · 2020年4月8日

【CVPR2020-旷视】DPGN：分布传播图网络的小样本学习

【CVPR2020-旷视】DPGN：分布传播图网络的小样本学习

专知

13+阅读 · 2020年4月1日

“自然语言-草图”耦合的地理场景查询方法研究

国家自然科学基金

3+阅读 · 2015年12月31日

大脑皮层褶皱形成“共推理论”研究

国家自然科学基金

0+阅读 · 2015年12月31日

基于自主学习的Ad hoc Agent序贯决策研究

国家自然科学基金

46+阅读 · 2015年12月31日

基于人类3D视觉感应的2D到3D视频转换关键技术研究

国家自然科学基金

2+阅读 · 2015年12月31日

基于高空间分辨电子显微学In2-xGaxO3(ZnO)m缺陷分析

国家自然科学基金

0+阅读 · 2015年12月31日

VisRes Bench: On Evaluating the Visual Reasoning Capabilities of VLMs

Arxiv

0+阅读 · 12月24日

Psychometric Validation of the Sophotechnic Mediation Scale and a New Understanding of the Development of GenAI Mastery: Lessons from 3,932 Adult Brazilian Workers

Arxiv

0+阅读 · 12月24日

AdaptPrompt: Parameter-Efficient Adaptation of VLMs for Generalizable Deepfake Detection

Arxiv

0+阅读 · 12月19日

Towards Explainable Conversational AI for Early Diagnosis with Large Language Models

Arxiv

0+阅读 · 12月19日

An Investigation on How AI-Generated Responses Affect SoftwareEngineering Surveys

Arxiv

0+阅读 · 12月19日

VIP会员

文章信息

相关主题

相关VIP内容

【NeurIPS2025】迈向开放世界的三维“物体性”学习

【NeurIPS2025】迈向开放世界的三维“物体性”学习

专知会员服务

11+阅读 · 10月21日

【CVPR2025】重新思考长时视频理解中的时序检索

【CVPR2025】重新思考长时视频理解中的时序检索

专知会员服务

13+阅读 · 4月6日

【ECCV2022】对比视觉Transformer的在线持续学习

【ECCV2022】对比视觉Transformer的在线持续学习

专知会员服务

23+阅读 · 2022年7月29日

【视觉和语言导航:任务、方法和未来方向的综述】Vision-and-Language Navigation: A Survey of Tasks, Methods, and Future Directions

【视觉和语言导航:任务、方法和未来方向的综述】Vision-and-Language Navigation: A Survey of Tasks, Methods, and Future Directions

专知会员服务

37+阅读 · 2022年3月25日

借助几何先验知识促进深度神经网络：综述 | Boosting Deep Neural Networks with Geometrical Prior Knowledge: A Survey

借助几何先验知识促进深度神经网络：综述 | Boosting Deep Neural Networks with Geometrical Prior Knowledge: A Survey

专知会员服务

29+阅读 · 2020年7月10日

热门VIP内容

开通专知VIP会员享更多权益服务

【斯坦福博士论文】数据、决策与过度依赖：构建可信人工智能的核心挑战

《多域时代中维持弹性军事训练：挑战与机遇》

【AAAI2026】专家数量何为最优？面向混合专家模型的语义专业化优化研究

自进化人工智能体的全面综述：连接基础模型与终身自主智能系统的新范式

相关资讯

AAAI 2022 | ProtGNN：自解释图神经网络

AAAI 2022 | ProtGNN：自解释图神经网络

专知

10+阅读 · 2022年2月28日

ICLR'21 | GNN联邦学习的新基准

ICLR'21 | GNN联邦学习的新基准

图与推荐

12+阅读 · 2021年11月15日

最新最全《深度元学习》2021综述论文，68页pdf，A Survey of Deep Meta-Learning

最新最全《深度元学习》2021综述论文，68页pdf，A Survey of Deep Meta-Learning

专知

11+阅读 · 2021年4月23日

论文浅尝 | GEOM-GCN: Geometric Graph Convolutional Networks

论文浅尝 | GEOM-GCN: Geometric Graph Convolutional Networks

开放知识图谱

14+阅读 · 2020年4月8日

【CVPR2020-旷视】DPGN：分布传播图网络的小样本学习

【CVPR2020-旷视】DPGN：分布传播图网络的小样本学习

专知

13+阅读 · 2020年4月1日

相关论文

VisRes Bench: On Evaluating the Visual Reasoning Capabilities of VLMs

Arxiv

0+阅读 · 12月24日

Psychometric Validation of the Sophotechnic Mediation Scale and a New Understanding of the Development of GenAI Mastery: Lessons from 3,932 Adult Brazilian Workers

Arxiv

0+阅读 · 12月24日

AdaptPrompt: Parameter-Efficient Adaptation of VLMs for Generalizable Deepfake Detection

Arxiv

0+阅读 · 12月19日

Towards Explainable Conversational AI for Early Diagnosis with Large Language Models

Arxiv

0+阅读 · 12月19日

An Investigation on How AI-Generated Responses Affect SoftwareEngineering Surveys

Arxiv

0+阅读 · 12月19日

相关基金

“自然语言-草图”耦合的地理场景查询方法研究

国家自然科学基金

3+阅读 · 2015年12月31日

大脑皮层褶皱形成“共推理论”研究

国家自然科学基金

0+阅读 · 2015年12月31日

基于自主学习的Ad hoc Agent序贯决策研究

国家自然科学基金

46+阅读 · 2015年12月31日

基于人类3D视觉感应的2D到3D视频转换关键技术研究

国家自然科学基金

2+阅读 · 2015年12月31日

基于高空间分辨电子显微学In2-xGaxO3(ZnO)m缺陷分析

国家自然科学基金

0+阅读 · 2015年12月31日

微信扫码咨询专知VIP会员