The edifice of native Vision-Language Models (VLMs) has emerged as a rising contender to typical modular VLMs, shaped by evolving model architectures and training paradigms. Yet, two lingering clouds cast shadows over its widespread exploration and promotion: (-) What fundamental constraints set native VLMs apart from modular ones, and to what extent can these barriers be overcome? (-) How to make research in native VLMs more accessible and democratized, thereby accelerating progress in the field. In this paper, we clarify these challenges and outline guiding principles for constructing native VLMs. Specifically, one native VLM primitive should: (i) effectively align pixel and word representations within a shared semantic space; (ii) seamlessly integrate the strengths of formerly separate vision and language modules; (iii) inherently embody various cross-modal properties that support unified vision-language encoding, aligning, and reasoning. Hence, we launch NEO, a novel family of native VLMs built from first principles, capable of rivaling top-tier modular counterparts across diverse real-world scenarios. With only 390M image-text examples, NEO efficiently develops visual perception from scratch while mitigating vision-language conflicts inside a dense and monolithic model crafted from our elaborate primitives. We position NEO as a cornerstone for scalable and powerful native VLMs, paired with a rich set of reusable components that foster a cost-effective and extensible ecosystem. Our code and models are publicly available at: https://github.com/EvolvingLMMs-Lab/NEO.
翻译:原生视觉语言模型(VLMs)的构建体系已成为传统模块化VLMs的有力竞争者,其发展由不断演进的模型架构与训练范式所塑造。然而,仍有两大悬而未决的问题阻碍了其广泛探索与推广:第一,原生VLMs与模块化VLMs之间存在哪些根本性约束差异,这些障碍能在多大程度上被克服?第二,如何使原生VLMs的研究更易于获取和普及,从而加速该领域的进展。本文中,我们厘清了这些挑战,并阐述了构建原生VLMs的指导原则。具体而言,一个原生VLM基元应具备以下特性:(i)在共享语义空间中有效对齐像素与词汇表征;(ii)无缝整合先前独立的视觉与语言模块的优势;(iii)内在地体现多种跨模态特性,以支持统一的视觉语言编码、对齐与推理。基于此,我们推出了NEO——一个从第一性原理构建的全新原生VLM系列,其能够在多样化的现实场景中与顶尖模块化模型相媲美。仅使用3.9亿图像-文本样本,NEO便从零开始高效地发展出视觉感知能力,同时在我们精心设计的稠密单体模型内部缓解了视觉与语言之间的冲突。我们将NEO定位为可扩展且强大的原生VLMs的基石,并配套提供一系列可复用组件,以培育一个高性价比、可扩展的生态系统。我们的代码与模型已公开于:https://github.com/EvolvingLMMs-Lab/NEO。