The prediction of protein structures from sequences is an important task for function prediction, drug design, and related biological processes understanding. Recent advances have proved the power of language models (LMs) in processing the protein sequence databases, which inherit the advantages of attention networks and capture useful information in learning representations for proteins. The past two years have witnessed remarkable success in tertiary protein structure prediction (PSP), including evolution-based and single-sequence-based PSP. It seems that instead of using energy-based models and sampling procedures, protein language model (pLM)-based pipelines have emerged as mainstream paradigms in PSP. Despite the fruitful progress, the PSP community needs a systematic and up-to-date survey to help bridge the gap between LMs in the natural language processing (NLP) and PSP domains and introduce their methodologies, advancements and practical applications. To this end, in this paper, we first introduce the similarities between protein and human languages that allow LMs extended to pLMs, and applied to protein databases. Then, we systematically review recent advances in LMs and pLMs from the perspectives of network architectures, pre-training strategies, applications, and commonly-used protein databases. Next, different types of methods for PSP are discussed, particularly how the pLM-based architectures function in the process of protein folding. Finally, we identify challenges faced by the PSP community and foresee promising research directions along with the advances of pLMs. This survey aims to be a hands-on guide for researchers to understand PSP methods, develop pLMs and tackle challenging problems in this field for practical purposes.
翻译:从序列中预测蛋白质结构是功能预测、药物设计和相关生物过程理解的一项重要任务。最近的进展证明语言模型在处理蛋白序列数据库方面的力量。蛋白序列数据库继承了关注网络的优势,在蛋白质的学习演示中捕捉了有用的信息。过去两年在第三层蛋白结构预测(PSP)方面取得了显著成功,包括基于进化和单一序列的PSP。看来,蛋白语言模型(PLM)管道不是使用基于能源的模型和取样程序,而是在PSP中成为主流范例。尽管取得了丰硕的进展,但PSP社区社区需要系统和最新的调查,以帮助弥合自然语言处理(NLSP)和PSP领域学习展示的优势差距,并介绍其方法、进步和实际应用。为此,我们首先介绍蛋白和人类语言之间的相似性,允许LM模型扩展至PLMM,并应用于蛋白质数据库。我们系统地审查LM和PLMM公司的最新进展,从网络结构的视角、培训前战略、常规应用和常规应用方法,在PL结构中,为PMRF的前沿研究流程制定具有挑战性的方法。