普罗提纳代议制学习检索序列增加</s> (Retrieved Sequence Augmentation for Protein Representation Learning)

Protein language models have excelled in a variety of tasks, ranging from structure prediction to protein engineering. However, proteins are highly diverse in functions and structures, and current state-of-the-art models including the latest version of AlphaFold rely on Multiple Sequence Alignments (MSA) to feed in the evolutionary knowledge. Despite their success, heavy computational overheads, as well as the de novo and orphan proteins remain great challenges in protein representation learning. In this work, we show that MSAaugmented models inherently belong to retrievalaugmented methods. Motivated by this finding, we introduce Retrieved Sequence Augmentation(RSA) for protein representation learning without additional alignment or pre-processing. RSA links query protein sequences to a set of sequences with similar structures or properties in the database and combines these sequences for downstream prediction. We show that protein language models benefit from the retrieval enhancement on both structure prediction and property prediction tasks, with a 5% improvement on MSA Transformer on average while being 373 times faster. In addition, we show that our model can transfer to new protein domains better and outperforms MSA Transformer on de novo protein prediction. Our study fills a much-encountered gap in protein prediction and brings us a step closer to demystifying the domain knowledge needed to understand protein sequences. Code is available on https://github.com/HKUNLP/RSA.

翻译：蛋白质语言模型在从结构预测到蛋白质工程等各种任务中表现得非常出色,但蛋白质模型在功能和结构方面差异很大,而目前最先进的模型,包括最新版本的阿尔法福德模型,依赖多序列对齐(MSA)来提供进化知识。尽管取得了成功,但大量计算间接费用,以及取消新发蛋蛋白和孤儿蛋白在蛋白代表学习方面仍然是巨大的挑战。在这项工作中,我们显示,经调整的模型本身就属于检索的方法。受这一发现激励,我们引入了最新的蛋白质代表学习序列(RSA),而没有进一步的调整或预处理。RSA将蛋白序列与数据库中具有类似结构或属性的一组序列联系起来,并将这些序列合并起来进行下游预测。我们显示,蛋白语言模型受益于结构预测和财产预测任务的检索增强,平均对IMIS变质蛋白质模型的改进5%,同时缩短了373倍的差距。此外,我们显示,我们的模型可以向新的蛋白质领域转移更接近于我们对蛋白质领域的预测,并超越了我们对质领域的预测。</s>

相关内容

MoDELS

关注 43

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

百篇论文纵览大型语言模型最新研究进展

专知会员服务

70+阅读 · 2023年3月31日

【干货书】深度学习合成数据，354页pdf，Synthetic Data for Deep Learning

专知会员服务

104+阅读 · 2022年2月10日

多标签学习的新趋势（2020 Survey）

专知会员服务

43+阅读 · 2020年12月6日

100+篇《自监督学习(Self-Supervised Learning)》论文最新合集

专知会员服务

165+阅读 · 2020年3月18日