Predicting protein function from sequence is a central challenge in computational biology. While existing methods rely heavily on structured ontologies or similarity-based techniques, they often lack the flexibility to express structure-free functional descriptions and novel biological functions. In this work, we introduce Prot2Text-V2, a novel multimodal sequence-to-text model that generates free-form natural language descriptions of protein function directly from amino acid sequences. Our method combines a protein language model as a sequence encoder (ESM-3B) and a decoder-only language model (LLaMA-3.1-8B-Instruct) through a lightweight nonlinear modality projector. A key innovation is our Hybrid Sequence-level Contrastive Alignment Learning (H-SCALE), which improves cross-modal learning by matching mean- and std-pooled protein embeddings with text representations via contrastive loss. After the alignment phase, we apply instruction-based fine-tuning using LoRA on the decoder to teach the model how to generate accurate protein function descriptions conditioned on the protein sequence. We train Prot2Text-V2 on about 250K curated entries from SwissProt and evaluate it under low-homology conditions, where test sequences have low similarity with training samples. Prot2Text-V2 consistently outperforms traditional and LLM-based baselines across various metrics.
翻译:从序列预测蛋白质功能是计算生物学中的一个核心挑战。虽然现有方法严重依赖于结构化本体或基于相似性的技术,但它们通常缺乏灵活性,无法表达无结构的功能描述和新的生物学功能。在本工作中,我们介绍了Prot2Text-V2,这是一种新颖的多模态序列到文本模型,能够直接从氨基酸序列生成自由形式的蛋白质功能自然语言描述。我们的方法通过一个轻量级非线性模态投影器,将作为序列编码器的蛋白质语言模型(ESM-3B)与一个仅解码器的语言模型(LLaMA-3.1-8B-Instruct)相结合。一个关键的创新是我们的混合序列级对比对齐学习(H-SCALE),它通过对比损失将均值池化和标准差池化的蛋白质嵌入与文本表示进行匹配,从而改进跨模态学习。在对齐阶段之后,我们在解码器上应用基于指令的LoRA微调,以教导模型如何根据蛋白质序列生成准确的功能描述。我们使用来自SwissProt的大约25万个精选条目训练Prot2Text-V2,并在低同源性条件下(即测试序列与训练样本相似度低)对其进行评估。在各种指标上,Prot2Text-V2始终优于传统方法和基于大语言模型的基线。