Understanding the binding specificity between T-cell receptors (TCRs) and peptide-major histocompatibility complexes (pMHCs) is central to immunotherapy and vaccine development. However, current predictive models struggle with generalization, especially in data-scarce settings and when faced with novel epitopes. We present LANTERN (Large lAnguage model-powered TCR-Enhanced Recognition Network), a deep learning framework that combines large-scale protein language models with chemical representations of peptides. By encoding TCR \b{eta}-chain sequences using ESM-1b and transforming peptide sequences into SMILES strings processed by MolFormer, LANTERN captures rich biological and chemical features critical for TCR-peptide recognition. Through extensive benchmarking against existing models such as ChemBERTa, TITAN, and NetTCR, LANTERN demonstrates superior performance, particularly in zero-shot and few-shot learning scenarios. Our model also benefits from a robust negative sampling strategy and shows significant clustering improvements via embedding analysis. These results highlight the potential of LANTERN to advance TCR-pMHC binding prediction and support the development of personalized immunotherapies.
翻译:理解T细胞受体(TCR)与肽段-主要组织相容性复合物(pMHC)之间的结合特异性,对于免疫疗法和疫苗开发至关重要。然而,现有预测模型在泛化能力方面存在不足,尤其是在数据稀缺的场景下以及面对新表位时。我们提出了LANTERN(基于大语言模型的TCR增强识别网络),这是一个深度学习框架,它将大规模蛋白质语言模型与肽段的化学表征相结合。通过使用ESM-1b编码TCR \b{eta}链序列,并将肽段序列转化为由MolFormer处理的SMILES字符串,LANTERN能够捕捉对TCR-肽段识别至关重要的丰富生物学和化学特征。通过与ChemBERTa、TITAN和NetTCR等现有模型进行广泛的基准测试,LANTERN展现出卓越的性能,特别是在零样本和小样本学习场景中。我们的模型还受益于稳健的负采样策略,并通过嵌入分析显示出显著的聚类改进。这些结果凸显了LANTERN在推进TCR-pMHC结合预测和支持个性化免疫疗法开发方面的潜力。