Directed evolution is a versatile technique in protein engineering that mimics the process of natural selection by iteratively alternating between mutagenesis and screening in order to search for sequences that optimize a given property of interest, such as catalytic activity and binding affinity to a specified target. However, the space of possible proteins is too large to search exhaustively in the laboratory, and functional proteins are scarce in the vast sequence space. Machine learning (ML) approaches can accelerate directed evolution by learning to map protein sequences to functions without building a detailed model of the underlying physics, chemistry and biological pathways. Despite the great potentials held by these ML methods, they encounter severe challenges in identifying the most suitable sequences for a targeted function. These failures can be attributed to the common practice of adopting a high-dimensional feature representation for protein sequences and inefficient search methods. To address these issues, we propose an efficient, experimental design-oriented closed-loop optimization framework for protein directed evolution, termed ODBO, which employs a combination of novel low-dimensional protein encoding strategy and Bayesian optimization enhanced with search space prescreening via outlier detection. We further design an initial sample selection strategy to minimize the number of experimental samples for training ML models. We conduct and report four protein directed evolution experiments that substantiate the capability of the proposed framework for finding of the variants with properties of interest. We expect the ODBO framework to greatly reduce the experimental cost and time cost of directed evolution, and can be further generalized as a powerful tool for adaptive experimental design in a broader context.
翻译:直接进化是蛋白质工程的一种多用途技术,它模仿自然选择过程,在诱变和筛选之间反复交替,以寻找优化特定利益属性的序列,如催化活动和与特定目标的结合等;然而,可能的蛋白质空间太大,无法在实验室内彻底搜索,功能性蛋白在巨大的序列空间里稀少;机器学习(ML)方法可以通过在不建立基础物理、化学和生物路径的详细模型的情况下将蛋白序列绘制为功能而加快进化过程。尽管这些ML方法具有巨大的潜力,但它们在为特定功能确定最合适的序列方面遇到了严峻的挑战。这些失败可归因于对蛋白序列采用高维特征描述和低效率搜索方法的共同做法。为了解决这些问题,我们建议一个高效、试验性的、面向实验的闭环优化框架,用于蛋白直接演化,它采用新的低维蛋白化变异变战略,并辅之以通过外部检测空间前筛选的更大潜力。我们进一步设计了一个用于蛋白质序列测试的初始试算模型,我们提出了一个实验性模型的试算模型,以最大限度地降低成本。