Protein language models (LMs) have been successful in sequence, structural and functional predictions. However, currently, protein LMs are limited to encoder- or decoder-only architectures for single sequences while many biological contexts involve protein-protein interactions. Here, we introduce pAbT5, which models antibody chain pairing as forward- and back-translations using a T5-based architecture. We show that pAbT5 accurately reflects chain pairing through sequence generation. Our protein LM generates variable-length sequences and its next-word prediction probability agrees with position-specific scoring matrix from sequence alignment. Like other works in protein LM, pAbT5 performs state-of-the-art unsupervised prediction on experimental measurements. To the best of our knowledge, pAbT5 is the first generative encoder-decoder protein LM for protein-protein interactions.
翻译:蛋白质语言模型已经成功地应用于序列、结构和功能预测。然而,目前,蛋白质语言模型仅限于单个序列的编码器或解码器,而许多生物学背景涉及蛋白质-蛋白质相互作用。在本研究中,我们引入了pAbT5,通过基于T5的架构将抗体链匹配建模为正向和反向翻译。我们证明了pAbT5通过序列生成准确地反映了链配对。我们的蛋白质语言模型可以生成可变长度的序列,并且其下一个词的预测概率与来自序列比对的位置特异性评分矩阵一致。像蛋白质语言模型中的其他作品一样,pAbT5在实验测量上进行了最先进的无监督预测。据我们所知,pAbT5是第一个用于蛋白质-蛋白质相互作用的生成型编码器-解码器蛋白质语言模型。