Protein language models (pLMs), pre-trained via causal language modeling on protein sequences, have been a promising tool for protein sequence design. In real-world protein engineering, there are many cases where the amino acids in the middle of a protein sequence are optimized while maintaining other residues. Unfortunately, because of the left-to-right nature of pLMs, existing pLMs modify suffix residues by prompting prefix residues, which are insufficient for the infilling task that considers the whole surrounding context. To find the more effective pLMs for protein engineering, we design a new benchmark, Secondary structureE InFilling rEcoveRy, SEIFER, which approximates infilling sequence design scenarios. With the evaluation of existing models on the benchmark, we reveal the weakness of existing language models and show that language models trained via fill-in-middle transformation, called ProtFIM, are more appropriate for protein engineering. Also, we prove that ProtFIM generates protein sequences with decent protein representations through exhaustive experiments and visualizations.
翻译:摘要:预训练蛋白语言模型 (pLMs) 已成为蛋白质序列设计的有力工具。在实际蛋白质工程中,有很多情况下需要优化蛋白序列中间的氨基酸,同时保持其他残基。不幸的是,由于现有的 pLMs 的从左到右编码方式,存在这样的问题:输入前缀残基时,仅从后缀残基得出提示,这对于考虑整个上下文的中间填充任务是不足够的。为了找到更有效的蛋白质工程 pLMs,作者构建了一个新的基准测试体系,即二级结构填充恢复(Secondary structureE InFilling rEcoveRy, SEIFER)。通过现有模型在基准测试上进行评估,作者揭示了现有语言模型的缺点并表明,通过中间填充转换训练的语言模型所生成的 ProtFIM 对于蛋白质工程是更合适的。此外,文中通过详尽的实验和可视化证明,ProtFIM 生成的蛋白质序列具有良好的蛋白质表征。