Diffusion- and flow-based generative models have recently demonstrated strong performance in protein backbone generation tasks, offering unprecedented capabilities for de novo protein design. However, while achieving notable performance in generation quality, these models are limited by their generating speed, often requiring hundreds of iterative steps in the reverse-diffusion process. This computational bottleneck limits their practical utility in large-scale protein discovery, where thousands to millions of candidate structures are needed. To address this challenge, we explore the techniques of score distillation, which has shown great success in reducing the number of sampling steps in the vision domain while maintaining high generation quality. However, a straightforward adaptation of these methods results in unacceptably low designability. Through extensive study, we have identified how to appropriately adapt Score identity Distillation (SiD), a state-of-the-art score distillation strategy, to train few-step protein backbone generators which significantly reduce sampling time, while maintaining comparable performance to their pretrained teacher model. In particular, multistep generation combined with inference time noise modulation is key to the success. We demonstrate that our distilled few-step generators achieve more than a 20-fold improvement in sampling speed, while achieving similar levels of designability, diversity, and novelty as the Proteina teacher model. This reduction in inference cost enables large-scale in silico protein design, thereby bringing diffusion-based models closer to real-world protein engineering applications. The PyTorch implementation is available at https://github.com/LY-Xie/SiD_Protein
翻译:基于扩散和流的生成模型近期在蛋白质骨架生成任务中展现出卓越性能,为从头蛋白质设计提供了前所未有的能力。然而,尽管在生成质量方面取得了显著成效,这些模型受限于其生成速度,通常需要在反向扩散过程中进行数百次迭代步骤。这一计算瓶颈限制了它们在大规模蛋白质发现中的实际应用,因为该过程需要成千上万个候选结构。为应对这一挑战,我们探索了分数蒸馏技术,该技术在视觉领域已成功实现减少采样步骤的同时保持高生成质量。然而,直接套用这些方法会导致设计可行性降至不可接受的水平。通过深入研究,我们明确了如何恰当调整最先进的分数蒸馏策略——分数恒等蒸馏(SiD),以训练少步蛋白质骨架生成器,从而在保持与预训练教师模型相当性能的同时,显著缩短采样时间。特别是,多步生成与推理时噪声调制的结合是成功的关键。我们证明,经蒸馏的少步生成器在采样速度上实现了超过20倍的提升,同时在设计可行性、多样性和新颖性方面达到了与Proteina教师模型相当的水平。这种推理成本的降低使得大规模计算机模拟蛋白质设计成为可能,从而使基于扩散的模型更接近实际蛋白质工程应用。PyTorch实现可在https://github.com/LY-Xie/SiD_Protein获取。