迷失于分词：上下文作为解锁科学大语言模型中生物分子理解的关键 (Lost in Tokenization: Context as the Key to Unlocking Biomolecular Understanding in Scientific LLMs)

Kai Zhuang,Jiawei Zhang,Yumou Liu,Hanqun Cao,Chunbin Gu,Mengdi Liu,Zhangyang Gao,Zitong Jerry Wang,Xuanhe Zhou,Pheng-Ann Heng,Lijun Wu,Conghui He,Cheng Tan

from arxiv, 38 pages, under review

Scientific Large Language Models (Sci-LLMs) have emerged as a promising frontier for accelerating biological discovery. However, these models face a fundamental challenge when processing raw biomolecular sequences: the tokenization dilemma. Whether treating sequences as a specialized language, risking the loss of functional motif information, or as a separate modality, introducing formidable alignment challenges, current strategies fundamentally limit their reasoning capacity. We challenge this sequence-centric paradigm by positing that a more effective strategy is to provide Sci-LLMs with high-level structured context derived from established bioinformatics tools, thereby bypassing the need to interpret low-level noisy sequence data directly. Through a systematic comparison of leading Sci-LLMs on biological reasoning tasks, we tested three input modes: sequence-only, context-only, and a combination of both. Our findings are striking: the context-only approach consistently and substantially outperforms all other modes. Even more revealing, the inclusion of the raw sequence alongside its high-level context consistently degrades performance, indicating that raw sequences act as informational noise, even for models with specialized tokenization schemes. These results suggest that the primary strength of existing Sci-LLMs lies not in their nascent ability to interpret biomolecular syntax from scratch, but in their profound capacity for reasoning over structured, human-readable knowledge. Therefore, we argue for reframing Sci-LLMs not as sequence decoders, but as powerful reasoning engines over expert knowledge. This work lays the foundation for a new class of hybrid scientific AI agents, repositioning the developmental focus from direct sequence interpretation towards high-level knowledge synthesis. The code is available at https://github.com/opendatalab-raiser/CoKE.

翻译：科学大语言模型已成为加速生物学发现的一个前沿领域。然而，这些模型在处理原始生物分子序列时面临一个根本性挑战：分词困境。无论是将序列视为一种专门的语言（可能导致功能基序信息丢失），还是将其视为独立的模态（引入巨大的对齐挑战），当前的策略从根本上限制了模型的推理能力。我们挑战这种以序列为中心的范式，提出一种更有效的策略是为科学大语言模型提供来自成熟生物信息学工具的高层次结构化上下文，从而绕过直接解释低层次噪声序列数据的需求。通过对领先科学大语言模型在生物学推理任务上的系统比较，我们测试了三种输入模式：仅序列、仅上下文以及两者结合。我们的发现引人注目：仅上下文方法在所有任务中均显著且持续优于其他模式。更值得关注的是，将原始序列与其高层次上下文结合输入反而会持续降低模型性能，这表明原始序列即使对于具备专门分词方案的模型而言也构成了信息噪声。这些结果表明，现有科学大语言模型的主要优势并非源于其从零解析生物分子语法的新兴能力，而在于其对结构化、人类可读知识进行深度推理的强大潜力。因此，我们主张将科学大语言模型重新定位为专家知识的强大推理引擎，而非序列解码器。本研究为新一代混合型科学智能代理奠定了基础，将发展重心从直接序列解释转向高层次知识综合。代码发布于 https://github.com/opendatalab-raiser/CoKE。

相关内容

MoDELS

关注 44

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日

Connections between Support Vector Machines, Wasserstein distance and gradient-penalty GANs

专知会员服务

36+阅读 · 2019年10月17日

Deep Learning Based Detection and Correction of Cardiac MR Motion Artefacts During Reconstruction for High-Quality Segmentation

专知会员服务

59+阅读 · 2019年10月17日