How can we design protein sequences folding into the desired structures effectively and efficiently? AI methods for structure-based protein design have attracted increasing attention in recent years; however, few methods can simultaneously improve the accuracy and efficiency due to the lack of expressive features and autoregressive sequence decoder. To address these issues, we propose PiFold, which contains a novel residue featurizer and PiGNN layers to generate protein sequences in a one-shot way with improved recovery. Experiments show that PiFold could achieve 51.66\% recovery on CATH 4.2, while the inference speed is 70 times faster than the autoregressive competitors. In addition, PiFold achieves 58.72\% and 60.42\% recovery scores on TS50 and TS500, respectively. We conduct comprehensive ablation studies to reveal the role of different types of protein features and model designs, inspiring further simplification and improvement. The PyTorch code is available at \href{https://github.com/A4Bio/PiFold}{GitHub}.
翻译:如何有效、高效地设计折叠成所需结构的蛋白质序列?近年来,基于结构的蛋白质设计的人工智能方法备受关注。然而,由于缺乏有效表征特征和自回归序列解码器,很少有方法能够同时提高准确性和效率。为了解决这些问题,我们提出了PiFold。该模型包含一种新的残基特征生成器和PiGNN层,可以以一次抵偿法生成蛋白质序列,并可以提高恢复效果。实验表明,PiFold 在 CATH 4.2 的恢复率可达 51.66%,而推理速度比自回归竞争者快 70 倍。此外,PiFold 在 TS50 和 TS500 上的恢复率分别为 58.72% 和 60.42%。我们进行了全面的消融实验,揭示了不同类型的蛋白质特征和模型设计的作用,启发进一步简化和改进。PyTorch代码可在 \href{https://github.com/A4Bio/PiFold}{GitHub} 上找到。