Large-scale Protein Language Models (PLMs) have improved performance in protein prediction tasks, ranging from 3D structure prediction to various function predictions. In particular, AlphaFold, a ground-breaking AI system, could potentially reshape structural biology. However, the utility of the PLM module in AlphaFold, Evoformer, has not been explored beyond structure prediction. In this paper, we investigate the representation ability of three popular PLMs: ESM-1b (single sequence), MSA-Transformer (multiple sequence alignment) and Evoformer (structural), with a special focus on Evoformer. Specifically, we aim to answer the following key questions: (\romannumeral1) Does the Evoformer trained as part of AlphaFold produce representations amenable to predicting protein function? (\romannumeral2) If yes, can Evoformer replace ESM-1b and MSA-Transformer? (\romannumeral3) How much do these PLMs rely on evolution-related protein data? In this regard, are they complementary to each other? We compare these models by empirical study along with new insights and conclusions.Finally, we release code and datasets for reproducibility.
翻译:大规模蛋白语言模型(PLM)提高了蛋白质预测任务(从3D结构预测到各种功能预测等)的性能。特别是,创世的AI系统AlphaFold有可能改变结构生物学。然而,Evoforold 的PLM模块在结构预测之外尚未探索其效用。在本文中,我们调查了三种流行的PLM:ESM-1b(单一序列)、MIS-Transector (多重序列对齐)和Evofrent(结构)的代表性。在这方面,这些PLMS(多重序列对齐)和Evoreforent(结构),特别侧重于Evoexer。具体地说,我们的目标是回答以下关键问题:(romannupholal1)作为Alphord的一部分受过培训的EvoForold是否产生可以预测蛋白功能的表象? (\romannual2) 如果是的,Evoexterf 取代MS-1b(单一序列)和MIS-Transecterent? (\mannual3)这些PLMs 在多大程度上依赖于与进化相关的蛋白数据?在这方面,它们是否互相补充?我们通过实验性研究将这些模型与新的见解和结论比较这些模型。