AI-based protein structure prediction pipelines, such as AlphaFold2, have achieved near-experimental accuracy. These advanced pipelines mainly rely on Multiple Sequence Alignments (MSAs) as inputs to learn the co-evolution information from the homologous sequences. Nonetheless, searching MSAs from protein databases is time-consuming, usually taking dozens of minutes. Consequently, we attempt to explore the limits of fast protein structure prediction by using only primary sequences of proteins. HelixFold-Single is proposed to combine a large-scale protein language model with the superior geometric learning capability of AlphaFold2. Our proposed method, HelixFold-Single, first pre-trains a large-scale protein language model (PLM) with thousands of millions of primary sequences utilizing the self-supervised learning paradigm, which will be used as an alternative to MSAs for learning the co-evolution information. Then, by combining the pre-trained PLM and the essential components of AlphaFold2, we obtain an end-to-end differentiable model to predict the 3D coordinates of atoms from only the primary sequence. HelixFold-Single is validated in datasets CASP14 and CAMEO, achieving competitive accuracy with the MSA-based methods on the targets with large homologous families. Furthermore, HelixFold-Single consumes much less time than the mainstream pipelines for protein structure prediction, demonstrating its potential in tasks requiring many predictions. The code of HelixFold-Single is available at https://github.com/PaddlePaddle/PaddleHelix/tree/dev/apps/protein_folding/helixfold-single, and we also provide stable web services on https://paddlehelix.baidu.com/app/drug/protein-single/forecast.
翻译:基于AI的蛋白质结构预测管道,如AlphaFold2, 已经实现了接近实验性的准确性。 这些高级管道主要依赖多序列校正(MSAs)作为从同质序列中学习共进信息的投入。 尽管如此, 从蛋白数据库搜索协议是耗时的, 通常要花数十分钟。 因此, 我们试图通过只使用蛋白质主序列来探索快速蛋白结构预测的局限性。 建议HelixFold- Single将大型蛋白语言模型与阿尔法Fold2的高级几何学习能力结合起来。 我们的拟议方法, HelixFold- Single, 先是使用千千万万个主要序列来学习共同进化信息, 通常要花上几十分钟时间。 之后, 我们通过将事先训练过的PLM/AfFroud2 的基本组件组合起来, 最终到需要3D的模型, 也就是Helix- Flickral- flicks, 只能用高额的Cal- IMFlicks 来验证其最大的Cal- massal- massal- dal- dal- dalevild- sal dal dald- daldaldaldaldaldaldaldald 和Messaldaldaldaldald 。