Data-driven predictive methods which can efficiently and accurately transform protein sequences into biologically active structures are highly valuable for scientific research and therapeutical development. Determining accurate folding landscape using co-evolutionary information is fundamental to the success of modern protein structure prediction methods. As the state of the art, AlphaFold2 has dramatically raised the accuracy without performing explicit co-evolutionary analysis. Nevertheless, its performance still shows strong dependence on available sequence homologs. We investigated the cause of such dependence and presented EvoGen, a meta generative model, to remedy the underperformance of AlphaFold2 for poor MSA targets. EvoGen allows us to manipulate the folding landscape either by denoising the searched MSA or by generating virtual MSA, and helps AlphaFold2 fold accurately in low-data regime or even achieve encouraging performance with single-sequence predictions. Being able to make accurate predictions with few-shot MSA not only generalizes AlphaFold2 better for orphan sequences, but also democratizes its use for high-throughput applications. Besides, EvoGen combined with AlphaFold2 yields a probabilistic structure generation method which could explore alternative conformations of protein sequences, and the task-aware differentiable algorithm for sequence generation will benefit other related tasks including protein design.
翻译:以数据驱动的预测方法可以高效和准确地将蛋白序列转化为生物活跃的结构,这对于科学研究和治疗性发展非常宝贵。使用共同革命信息确定准确的折叠图景对于现代蛋白结构预测方法的成功至关重要。作为艺术的状态,阿尔法福尔德2号在没有进行明确的共革命分析的情况下大大提高了准确性。然而,它的性能仍然显示高度依赖现有的序列同质器。我们调查了这种依赖性的原因,并展示了EvoGen,这是一个元化基因化模型,以纠正阿尔法福尔德2的不良表现,用于差的特派任务生活津贴目标。EvoGen 允许我们通过解开搜索的特派任务生活津贴或生成虚拟管理协议来操纵折叠图环境,帮助阿尔法福尔德2号在低数据系统中准确折叠,甚至以单序列预测来鼓励业绩。它能够以几发的调热调频调的调调调调调,不仅使阿尔法福尔德2号更加适合孤儿的顺序,而且还使它用于高通量应用。此外,EvoGen结合阿尔法弗勒德2号还产生了一种可选择的蛋白质结构结构,包括不同的生产方法。