Protein language models have excelled in a variety of tasks, ranging from structure prediction to protein engineering. However, proteins are highly diverse in functions and structures, and current state-of-the-art models including the latest version of AlphaFold rely on Multiple Sequence Alignments (MSA) to feed in the evolutionary knowledge. Despite their success, heavy computational overheads, as well as the de novo and orphan proteins remain great challenges in protein representation learning. In this work, we show that MSAaugmented models inherently belong to retrievalaugmented methods. Motivated by this finding, we introduce Retrieved Sequence Augmentation(RSA) for protein representation learning without additional alignment or pre-processing. RSA links query protein sequences to a set of sequences with similar structures or properties in the database and combines these sequences for downstream prediction. We show that protein language models benefit from the retrieval enhancement on both structure prediction and property prediction tasks, with a 5% improvement on MSA Transformer on average while being 373 times faster. In addition, we show that our model can transfer to new protein domains better and outperforms MSA Transformer on de novo protein prediction. Our study fills a much-encountered gap in protein prediction and brings us a step closer to demystifying the domain knowledge needed to understand protein sequences. Code is available on https://github.com/HKUNLP/RSA.
翻译:蛋白质语言模型在从结构预测到蛋白质工程等各种任务中表现得非常出色,但蛋白质模型在功能和结构方面差异很大,而目前最先进的模型,包括最新版本的阿尔法福德模型,依赖多序列对齐(MSA)来提供进化知识。尽管取得了成功,但大量计算间接费用,以及取消新发蛋蛋白和孤儿蛋白在蛋白代表学习方面仍然是巨大的挑战。在这项工作中,我们显示,经调整的模型本身就属于检索的方法。受这一发现激励,我们引入了最新的蛋白质代表学习序列(RSA),而没有进一步的调整或预处理。RSA将蛋白序列与数据库中具有类似结构或属性的一组序列联系起来,并将这些序列合并起来进行下游预测。我们显示,蛋白语言模型受益于结构预测和财产预测任务的检索增强,平均对IMIS变质蛋白质模型的改进5%,同时缩短了373倍的差距。此外,我们显示,我们的模型可以向新的蛋白质领域转移更接近于我们对蛋白质领域的预测,并超越了我们对质领域的预测。</s>