In human cognition, there exist numerous thought processes that are tacit and beyond verbal expression, enabling us to understand and interact with the world in multiple ways. However, contemporary Vision-Language Models (VLMs) remain constrained to reasoning within the discrete and rigid space of linguistic tokens, thereby bottlenecking the rich, high-dimensional nature of visual perception. To bridge this gap, we propose CoCoVa (Chain of Continuous Vision-Language Thought), a novel framework for vision-language model that leverages continuous cross-modal reasoning for diverse vision-language tasks. The core of CoCoVa is an iterative reasoning cycle, where a novel Latent Q-Former (LQ-Former) acts as a dynamic reasoning engine, iteratively refining a chain of latent thought vectors through cross-modal fusion. To focus this process, a token selection mechanism dynamically identifies salient visual regions, mimicking attentional focus. To ensure these latent thoughts remain grounded, we train the model with a multi-task objective that combines contrastive learning and diffusion-based reconstruction, enforcing alignment between latent representations and both visual and textual modalities. Evaluations show CoCoVa improves accuracy and token efficiency over strong baselines. With a 1.5B backbone, it competes with or surpasses larger 7B-9B models on almost all benchmarks. When scaled to 7B LLM backbones, it remains competitive with state-of-the-art models. Qualitative analysis validates that learned latent space captures interpretable and structured reasoning patterns, highlighting the potential of CoCoVa to bridge the representational gap between discrete language processing and the continuous nature of visual understanding.
翻译:在人类认知中,存在大量内隐且难以用言语表达的思想过程,使我们能够以多种方式理解和与世界互动。然而,当前的视觉-语言模型(VLMs)仍局限于在离散且僵化的语言标记空间中进行推理,从而限制了视觉感知丰富的高维特性。为弥合这一差距,我们提出了CoCoVa(连续视觉-语言思维链),这是一种用于视觉-语言模型的新型框架,利用连续跨模态推理处理多样化的视觉-语言任务。CoCoVa的核心是一个迭代推理循环,其中新颖的潜在Q-Former(LQ-Former)作为动态推理引擎,通过跨模态融合迭代优化潜在思维向量链。为聚焦此过程,一种标记选择机制动态识别显著的视觉区域,模拟注意力聚焦。为确保这些潜在思维保持接地性,我们采用结合对比学习和基于扩散的重建的多任务目标训练模型,强制潜在表征与视觉及文本模态之间的对齐。评估表明,CoCoVa在准确性和标记效率上优于强基线模型。以1.5B骨干网络为例,其在几乎所有基准测试中与或超越更大的7B-9B模型。当扩展至7B大语言模型骨干时,仍与最先进模型保持竞争力。定性分析验证了学习到的潜在空间能够捕捉可解释且结构化的推理模式,突显了CoCoVa在弥合离散语言处理与视觉理解的连续性之间的表征鸿沟方面的潜力。