Protein representation learning methods have shown great potential to yield useful representation for many downstream tasks, especially on protein classification. Moreover, a few recent studies have shown great promise in addressing insufficient labels of proteins with self-supervised learning methods. However, existing protein language models are usually pretrained on protein sequences without considering the important protein structural information. To this end, we propose a novel structure-aware protein self-supervised learning method to effectively capture structural information of proteins. In particular, a well-designed graph neural network (GNN) model is pretrained to preserve the protein structural information with self-supervised tasks from a pairwise residue distance perspective and a dihedral angle perspective, respectively. Furthermore, we propose to leverage the available protein language model pretrained on protein sequences to enhance the self-supervised learning. Specifically, we identify the relation between the sequential information in the protein language model and the structural information in the specially designed GNN model via a novel pseudo bi-level optimization scheme. Experiments on several supervised downstream tasks verify the effectiveness of our proposed method.
翻译:蛋白质代表制学习方法显示,在为许多下游任务,特别是蛋白质分类工作提供有益代表性方面,具有巨大潜力。此外,最近一些研究显示,在用自监督的学习方法解决蛋白质标签不足的问题方面,前景前景非常光明。然而,现有的蛋白质语言模型通常在蛋白序列方面事先经过培训,而没有考虑重要的蛋白质结构信息。为此,我们提议采用一种新型结构意识蛋白质自我监督学习方法,以有效获取蛋白质结构信息。特别是,一个设计完善的图形神经网络模型(GNN)已经经过预先培训,以便分别从对等残余远程角度和对等角度保护自监督的蛋白质结构信息。此外,我们提议利用现有蛋白质序列前培训的蛋白质语言模型,以加强自监督的学习。具体地说,我们确定了蛋白质语言模型的序列信息与专门设计的GNN模型的结构信息之间的关系,通过一种新型的假双级优化计划,对若干监督的下游任务进行实验,以核实我们拟议方法的有效性。