插入和操作基于梯度的离散 MCMC 的蛋白质定向进化 (Plug & Play Directed Evolution of Proteins with Gradient-based Discrete MCMC)

from arxiv, 31 pages, 8 figures. To appear in the Machine Learning: Science & Technology (ML:S&T) journal. Code is available at https://github.com/pemami4911/ppde. A short version of this work appeared at the NeurIPS 2022 Machine Learning in Structural Biology Workshop

A long-standing goal of machine-learning-based protein engineering is to accelerate the discovery of novel mutations that improve the function of a known protein. We introduce a sampling framework for evolving proteins in silico that supports mixing and matching a variety of unsupervised models, such as protein language models, and supervised models that predict protein function from sequence. By composing these models, we aim to improve our ability to evaluate unseen mutations and constrain search to regions of sequence space likely to contain functional proteins. Our framework achieves this without any model fine-tuning or re-training by constructing a product of experts distribution directly in discrete protein space. Instead of resorting to brute force search or random sampling, which is typical of classic directed evolution, we introduce a fast MCMC sampler that uses gradients to propose promising mutations. We conduct in silico directed evolution experiments on wide fitness landscapes and across a range of different pre-trained unsupervised models, including a 650M parameter protein language model. Our results demonstrate an ability to efficiently discover variants with high evolutionary likelihood as well as estimated activity multiple mutations away from a wild type protein, suggesting our sampler provides a practical and effective new paradigm for machine-learning-based protein engineering.

翻译：机器学习相关的蛋白质工程的一个长期目标是加速发现改善已知蛋白质功能的新突变。我们引入了一种在计算机中进化蛋白质的采样框架，它支持混合各种无监督模型（例如蛋白质语言模型）和监督模型（可以通过序列预测蛋白质功能）。通过组成这些模型，我们旨在改善在未知突变（即不在训练集中的突变）的评估能力，并将搜索限制在可能包含功能性蛋白质的序列空间区域。我们的框架直接在离散蛋白质空间中构建专家的乘积分布，而不需要对模型进行任何微调或重新训练。我们引入了一种快速的 MCMC 采样器，该采样器使用梯度来提出有前途的突变，而不是采用传统的定向进化算法中的暴力搜索或随机采样方法。我们在宽阔的适应度地形和不同的预训练无监督模型之间进行了在计算机中的定向进化实验，包括一个含有 650 亿个参数的蛋白质语言模型。我们的结果表明，我们的方法能够高效地发现具有高进化可能性的变体，并且估计出距离野生型蛋白质多个突变之外的活性，这表明我们的采样器为基于机器学习的蛋白质工程提供了一种实用而有效的新范例。