从像素到词汇——迈向规模化原生视觉语言基元 (From Pixels to Words -- Towards Native Vision-Language Primitives at Scale)

The edifice of native Vision-Language Models (VLMs) has emerged as a rising contender to typical modular VLMs, shaped by evolving model architectures and training paradigms. Yet, two lingering clouds cast shadows over its widespread exploration and promotion: (-) What fundamental constraints set native VLMs apart from modular ones, and to what extent can these barriers be overcome? (-) How to make research in native VLMs more accessible and democratized, thereby accelerating progress in the field. In this paper, we clarify these challenges and outline guiding principles for constructing native VLMs. Specifically, one native VLM primitive should: (i) effectively align pixel and word representations within a shared semantic space; (ii) seamlessly integrate the strengths of formerly separate vision and language modules; (iii) inherently embody various cross-modal properties that support unified vision-language encoding, aligning, and reasoning. Hence, we launch NEO, a novel family of native VLMs built from first principles, capable of rivaling top-tier modular counterparts across diverse real-world scenarios. With only 390M image-text examples, NEO efficiently develops visual perception from scratch while mitigating vision-language conflicts inside a dense and monolithic model crafted from our elaborate primitives. We position NEO as a cornerstone for scalable and powerful native VLMs, paired with a rich set of reusable components that foster a cost-effective and extensible ecosystem. Our code and models are publicly available at: https://github.com/EvolvingLMMs-Lab/NEO.

翻译：原生视觉语言模型（VLMs）的构建体系已成为传统模块化VLMs的有力竞争者，其发展由不断演进的模型架构与训练范式所塑造。然而，仍有两大悬而未决的问题阻碍了其广泛探索与推广：第一，原生VLMs与模块化VLMs之间存在哪些根本性约束差异，这些障碍能在多大程度上被克服？第二，如何使原生VLMs的研究更易于获取和普及，从而加速该领域的进展。本文中，我们厘清了这些挑战，并阐述了构建原生VLMs的指导原则。具体而言，一个原生VLM基元应具备以下特性：（i）在共享语义空间中有效对齐像素与词汇表征；（ii）无缝整合先前独立的视觉与语言模块的优势；（iii）内在地体现多种跨模态特性，以支持统一的视觉语言编码、对齐与推理。基于此，我们推出了NEO——一个从第一性原理构建的全新原生VLM系列，其能够在多样化的现实场景中与顶尖模块化模型相媲美。仅使用3.9亿图像-文本样本，NEO便从零开始高效地发展出视觉感知能力，同时在我们精心设计的稠密单体模型内部缓解了视觉与语言之间的冲突。我们将NEO定位为可扩展且强大的原生VLMs的基石，并配套提供一系列可复用组件，以培育一个高性价比、可扩展的生态系统。我们的代码与模型已公开于：https://github.com/EvolvingLMMs-Lab/NEO。

相关内容

MoDELS

关注 44

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日

Connections between Support Vector Machines, Wasserstein distance and gradient-penalty GANs

专知会员服务

36+阅读 · 2019年10月17日

Stabilizing Transformers for Reinforcement Learning

专知会员服务

60+阅读 · 2019年10月17日