We present a new method for design problems wherein the goal is to maximize or specify the value of one or more properties of interest. For example, in protein design, one may wish to find the protein sequence that maximizes fluorescence. We assume access to one or more, potentially black box, stochastic "oracle" predictive functions, each of which maps from input (e.g., protein sequences) design space to a distribution over a property of interest (e.g. protein fluorescence). At first glance, this problem can be framed as one of optimizing the oracle(s) with respect to the input. However, many state-of-the-art predictive models, such as neural networks, are known to suffer from pathologies, especially for data far from the training distribution. Thus we need to modulate the optimization of the oracle inputs with prior knowledge about what makes `realistic' inputs (e.g., proteins that stably fold). Herein, we propose a new method to solve this problem, Conditioning by Adaptive Sampling, which yields state-of-the-art results on a protein fluorescence problem, as compared to other recently published approaches. Formally, our method achieves its success by using model-based adaptive sampling to estimate the conditional distribution of the input sequences given the desired properties.
翻译:我们提出了一个设计问题的新方法,目的是最大限度地增加或指定一种或多种感兴趣属性的价值。例如,在蛋白质设计中,人们可能希望找到蛋白质序列,以最大限度地增加荧光。我们承担一种或多种可能黑盒的“孔状”预测功能,其中每个功能都从输入(例如蛋白序列)设计空间到分配一个感兴趣的属性(例如蛋白质荧光度)的分布图。乍一看,这个问题可以被描述为优化投入的甲状腺的一个。然而,许多最先进的预测模型,例如神经网络,已知会受到病理学的困扰,特别是对于远离培训分布的数据。因此,我们需要根据对“现实”投入(例如蛋白质折叠的蛋白质)的原始知识,调整对甲状体投入的优化。我们提出了一种新的方法来解决这个问题,即通过调控取样模型(通过调和调试),通过最近出版的模型,通过将模型的常规蛋白质序列的模型,实现我们最先期的样本分配结果。