People see text. Humans read by recognizing words as visual objects, including their shapes, layouts, and patterns, before connecting them to meaning, which enables us to handle typos, distorted fonts, and various scripts effectively. Modern large language models (LLMs), however, rely on subword tokenization, fragmenting text into pieces from a fixed vocabulary. While effective for high-resource languages, this approach over-segments low-resource languages, yielding long, linguistically meaningless sequences and inflating computation. In this work, we challenge this entrenched paradigm and move toward a vision-centric alternative. Our method, SeeTok, renders text as images (visual-text) and leverages pretrained multimodal LLMs to interpret them, reusing strong OCR and text-vision alignment abilities learned from large-scale multimodal training. Across three different language tasks, SeeTok matches or surpasses subword tokenizers while requiring 4.43 times fewer tokens and reducing FLOPs by 70.5%, with additional gains in cross-lingual generalization, robustness to typographic noise, and linguistic hierarchy. SeeTok signals a shift from symbolic tokenization to human-like visual reading, and takes a step toward more natural and cognitively inspired language models.
翻译:人类通过视觉感知文本。人类阅读时首先将单词识别为视觉对象,包括其形状、布局和模式,随后才将其与语义关联,这种机制使我们能够有效处理拼写错误、变形字体及多样文字系统。然而,现代大语言模型依赖子词标记化技术,将文本分割为固定词汇表中的片段。虽然该方法对高资源语言有效,但在处理低资源语言时会产生过度分割,导致生成冗长且缺乏语言学意义的序列,并显著增加计算开销。本研究挑战这一固有范式,提出以视觉为中心的替代方案。我们的方法SeeTok将文本渲染为图像(视觉文本),并利用预训练的多模态大语言模型进行解读,复用其通过大规模多模态训练获得的强大OCR能力与文本-视觉对齐能力。在三种不同语言任务中,SeeTok在实现与子词标记器相当或更优性能的同时,所需标记数量减少4.43倍,浮点运算量降低70.5%,并在跨语言泛化、印刷噪声鲁棒性和语言层级结构理解方面展现出额外优势。SeeTok标志着从符号化标记向类人视觉阅读的范式转变,为构建更自然且受认知启发的语言模型迈出关键一步。