This paper demonstrates that language models are strong structure-based protein designers. We present LM-Design, a generic approach to reprogramming sequence-based protein language models (pLMs), that have learned massive sequential evolutionary knowledge from the universe of natural protein sequences, to acquire an immediate capability to design preferable protein sequences for given folds. We conduct a structural surgery on pLMs, where a lightweight structural adapter is implanted into pLMs and endows it with structural awareness. During inference, iterative refinement is performed to effectively optimize the generated protein sequences. Experiments show that LM-Design improves the state-of-the-art results by a large margin, leading to up to 4% to 12% accuracy gains in sequence recovery (e.g., 55.65%/56.63% on CATH 4.2/4.3 single-chain benchmarks, and >60% when designing protein complexes). We provide extensive and in-depth analyses, which verify that LM-Design can (1) indeed leverage both structural and sequential knowledge to accurately handle structurally non-deterministic regions, (2) benefit from scaling data and model size, and (3) generalize to other proteins (e.g., antibodies and de novo proteins)
翻译:本文展示语言模型是强大的基于结构的蛋白质设计师。 我们提出LM- Dedesign,这是对基于序列的蛋白语言模型进行重新编程的通用方法,它从自然蛋白序列的宇宙中学会了大量的连续进化知识,可以立即获得设计更佳的蛋白序列以给给定折叠体的能力。我们在pLMms上进行结构外科手术,将一个轻量结构适配器植入pLMms,使其具有结构意识。在推断期间,进行迭代精炼,以有效优化生成的蛋白序列序列序列序列。实验显示,LM-Dedesign通过大边缘改善最新科技成果,从而在序列恢复方面实现高达4%至12%的精度收益(例如,CATH 4.2/4.3单链基准为55.65%/56.63%,在设计蛋白质综合体时为>60%)。 我们提供广泛和深入的分析,核实LM-Dedes设计能够(1) 确实利用结构和顺序知识来准确处理结构上的非定型区域,(2) 从扩大数据和模型到其他蛋白质体大小(以及一般体)的好处。