Protein representation learning has primarily benefited from the remarkable development of language models (LMs). Accordingly, pre-trained protein models also suffer from a problem in LMs: a lack of factual knowledge. The recent solution models the relationships between protein and associated knowledge terms as the knowledge encoding objective. However, it fails to explore the relationships at a more granular level, i.e., the token level. To mitigate this, we propose Knowledge-exploited Auto-encoder for Protein (KeAP), which performs token-level knowledge graph exploration for protein representation learning. In practice, non-masked amino acids iteratively query the associated knowledge tokens to extract and integrate helpful information for restoring masked amino acids via attention. We show that KeAP can consistently outperform the previous counterpart on 9 representative downstream applications, sometimes surpassing it by large margins. These results suggest that KeAP provides an alternative yet effective way to perform knowledge enhanced protein representation learning.
翻译:蛋白质代表制学习主要得益于语言模型的显著发展。因此,经过培训的蛋白质模型在LMS中也存在一个问题:缺乏事实知识。最近的解决方案模型将蛋白质和相关知识术语之间的关系作为知识编码目标。然而,它未能在更颗粒层面,即象征性层面探索关系。为了减轻这一影响,我们提议为Protein(KeAP)开发知识开发自动编码器(KeAP),该软件为蛋白质代表制学习进行象征性知识图形探索。在实践上,非大规模氨基酸反复查询相关知识符号,以提取和整合有用的信息,通过关注恢复隐含的氨酸。我们表明,KeAP在9个具有代表性的下游应用中可以始终超越先前的对应方,有时大大超过这一水平。这些结果表明,KeAP为开展知识强化蛋白质代表制学习提供了另一种有效的替代方法。