蛋白因-蛋白因相互作用基于序列预测的多式培训前多模式模型 (Multimodal Pre-Training Model for Sequence-based Prediction of Protein-Protein Interaction)

Protein-protein interactions (PPIs) are essentials for many biological processes where two or more proteins physically bind together to achieve their functions. Modeling PPIs is useful for many biomedical applications, such as vaccine design, antibody therapeutics, and peptide drug discovery. Pre-training a protein model to learn effective representation is critical for PPIs. Most pre-training models for PPIs are sequence-based, which naively adopt the language models used in natural language processing to amino acid sequences. More advanced works utilize the structure-aware pre-training technique, taking advantage of the contact maps of known protein structures. However, neither sequences nor contact maps can fully characterize structures and functions of the proteins, which are closely related to the PPI problem. Inspired by this insight, we propose a multimodal protein pre-training model with three modalities: sequence, structure, and function (S2F). Notably, instead of using contact maps to learn the amino acid-level rigid structures, we encode the structure feature with the topology complex of point clouds of heavy atoms. It allows our model to learn structural information about not only the backbones but also the side chains. Moreover, our model incorporates the knowledge from the functional description of proteins extracted from literature or manual annotations. Our experiments show that the S2F learns protein embeddings that achieve good performances on a variety of PPIs tasks, including cross-species PPI, antibody-antigen affinity prediction, antibody neutralization prediction for SARS-CoV-2, and mutation-driven binding affinity change prediction.

翻译：蛋白质-蛋白质互动(PPIs)是许多生物过程的基本要素,在这些过程中,两种或更多的蛋白质在物理上相互结合,以达到其功能。模型化PPPI对于许多生物学应用,例如疫苗设计、抗体治疗和Peptide药物发现都有用。预培训蛋白模型以学习有效表现对于PPPI至关重要。大部分PPPI的预培训模型基于序列,天真地采用自然语言处理中使用的语言模型,以进行氨基酸序列。更先进的工程利用已知蛋白结构-觉预培训技术,利用已知蛋白结构的接触图。然而,对于与PPPPI问题密切相关的蛋白质结构和功能, 模型或接触图都无法充分描述蛋白质的结构和功能。受这一洞察启发,我们提出了一种多式蛋白预培训模型,其三种模式是:序列、结构和功能(S2F)。值得注意的是,使用接触地图来学习氨酸水平的硬质结构结构特征特征,但使用重质模型的表面云层云。它使我们的模型能够充分描述质- 学习关于SBestrobilevilal 。