与基于梯度的分立MCMC 的蛋白质插件和播放导导进化 (Plug & Play Directed Evolution of Proteins with Gradient-based Discrete MCMC)

from arxiv, 33 pages, 8 figures. Under review. Code is available at https://github.com/pemami4911/ppde. A short version of this work appeared at the NeurIPS 2022 Machine Learning in Structural Biology Workshop

A long-standing goal of machine-learning-based protein engineering is to accelerate the discovery of novel mutations that improve the function of a known protein. We introduce a sampling framework for evolving proteins in silico that supports mixing and matching a variety of unsupervised models, such as protein language models, and supervised models that predict protein function from sequence. By composing these models, we aim to improve our ability to evaluate unseen mutations and constrain search to regions of sequence space likely to contain functional proteins. Our framework achieves this without any model fine-tuning or re-training by constructing a product of experts distribution directly in discrete protein space. Instead of resorting to brute force search or random sampling, which is typical of classic directed evolution, we introduce a fast MCMC sampler that uses gradients to propose promising mutations. We conduct in silico directed evolution experiments on wide fitness landscapes and across a range of different pre-trained unsupervised models, including a 650M parameter protein language model. Our results demonstrate an ability to efficiently discover variants with high evolutionary likelihood as well as estimated activity multiple mutations away from a wild type protein, suggesting our sampler provides a practical and effective new paradigm for machine-learning-based protein engineering.

翻译：机器学习的蛋白质工程的长期目标是加速发现能够改善已知蛋白质功能的新型突变,加速发现能够改善已知蛋白质功能的新型突变。我们引入了硅质中不断发展的蛋白质质的抽样框架,支持混合和匹配各种不受监督的模型,如蛋白语言模型,并监督从序列中预测蛋白功能的模型。我们将这些模型组合在一起,目的是提高我们的能力,以便评估无形突变,限制对可能含有功能性蛋白的序列空间区域的搜索。我们的框架通过在离散蛋白空间中建造专家直接分布的产物,在没有任何模型微调或再培训的情况下实现了这一点。我们不采用典型的定向演化演化演化变异的典型方式,而是采用快速的MCMC取样器,使用梯度来提出有希望的突变。我们通过硅化实验将演化实验用于广泛的健康景观和各种预先训练过且不具有超强功能的模型,包括650M参数蛋白语言模型。我们的成果表明,能够高效率地发现具有高度进化可能性的变异异,并估计出远离野型蛋白质型蛋白质工程的多种活动模式。