Geometric deep learning has recently achieved great success in non-Euclidean domains, and learning on 3D structures of large biomolecules is emerging as a distinct research area. However, its efficacy is largely constrained due to the limited quantity of structural data. Meanwhile, protein language models trained on substantial 1D sequences have shown burgeoning capabilities with scale in a broad range of applications. Nevertheless, no preceding studies consider combining these different protein modalities to promote the representation power of geometric neural networks. To address this gap, we make the foremost step to integrate the knowledge learned by well-trained protein language models into several state-of-the-art geometric networks. Experiments are evaluated on a variety of protein representation learning benchmarks, including protein-protein interface prediction, model quality assessment, protein-protein rigid-body docking, and binding affinity prediction, leading to an overall improvement of 20% over baselines and the new state-of-the-art performance. Strong evidence indicates that the incorporation of protein language models' knowledge enhances geometric networks' capacity by a significant margin and can be generalized to complex tasks.
翻译:最近,在非欧洲的深海领域,几何深学取得了巨大成功,大型生物分子的3D结构的学习正在作为一个独特的研究领域出现。然而,由于结构数据数量有限,其功效在很大程度上受到限制。与此同时,在大量1D序列方面受过培训的蛋白质语言模型显示,在广泛的应用中,有大量的1D序列能力正在迅速增强。然而,以前没有任何研究考虑将这些不同的蛋白模式结合起来,以促进几何神经网络的代表性。为弥补这一差距,我们采取了最首要的步骤,将受过良好训练的蛋白语言模型所学的知识纳入若干最先进的几何计量网络。对各种蛋白质代表学习基准进行了评估,包括蛋白质-蛋白接口预测、模型质量评估、蛋白质-蛋白硬体对接和结合性亲近性预测,导致在基线和新状态表现上总体改进了20%。有力的证据表明,蛋白语言模型的纳入知识将大大提升了几何网络的能力,可以推广到复杂的任务中。