This paper demonstrates that language models are strong structure-based protein designers. We present LM-Design, a generic approach to reprogramming sequence-based protein language models (pLMs), that have learned massive sequential evolutionary knowledge from the universe of natural protein sequences, to acquire an immediate capability to design preferable protein sequences for given folds. We conduct a structural surgery on pLMs, where a lightweight structural adapter is implanted into pLMs and endows it with structural awareness. During inference, iterative refinement is performed to effectively optimize the generated protein sequences. Experiments show that our approach outperforms the state-of-the-art methods by a large margin, leading to up to 4% to 12% accuracy gains in sequence recovery (e.g., 55.65% and 56.63% on CATH 4.2 and 4.3 single-chain benchmarks, and >60% when designing protein complexes). We provide extensive and in-depth analyses, which verify that LM-Design can (1) indeed leverage both structural and sequential knowledge to accurately handle structurally non-deterministic regions, (2) benefit from scaling data and model size, and (3) generalize to other proteins (e.g., antibodies and de novo proteins)
翻译:本文展示语言模型是强大的基于结构的蛋白质设计师。 我们提出LM- Dedesign,这是对基于序列的蛋白语言模型进行重新编程的通用方法,它从自然蛋白序列的宇宙中学会了大量的连续进化知识,可以立即获得设计更佳的蛋白序列给给定折叠件的能力。我们在pLMs上进行结构外科手术,将一个轻量结构适配器植入pLMs,并在结构上使其具有结构意识。在推断期间,进行迭代精炼,以有效地优化生成的蛋白序列序列序列序列。实验表明,我们的方法大大超越了最先进的蛋白质语言模型方法,从而在序列恢复方面取得了高达4%至12%的精度收益(例如,CATH 4.2和4.3单链基准为55.65%和56.63%,在设计蛋白质综合体时为>60%)。我们提供了广泛和深入的分析,核查LM-Dedes设计能够(1) 确实利用结构和顺序知识来准确处理结构上的非定型区域,(2) 从扩大数据和模型到其他蛋白质体规模和一般(不使其他蛋白质结构和模型)获得好处。