The development of powerful natural language models have increased the ability to learn meaningful representations of protein sequences. In addition, advances in high-throughput mutagenesis, directed evolution, and next-generation sequencing have allowed for the accumulation of large amounts of labeled fitness data. Leveraging these two trends, we introduce Regularized Latent Space Optimization (ReLSO), a deep transformer-based autoencoder which features a highly structured latent space that is trained to jointly generate sequences as well as predict fitness. Through regularized prediction heads, ReLSO introduces a powerful protein sequence encoder and novel approach for efficient fitness landscape traversal. Using ReLSO, we explicitly model the sequence-function landscape of large labeled datasets and generate new molecules by optimizing within the latent space using gradient-based methods. We evaluate this approach on several publicly-available protein datasets, including variant sets of anti-ranibizumab and GFP. We observe a greater sequence optimization efficiency (increase in fitness per optimization step) by ReLSO compared to other approaches, where ReLSO more robustly generates high-fitness sequences. Furthermore, the attention-based relationships learned by the jointly-trained ReLSO models provides a potential avenue towards sequence-level fitness attribution information.
翻译:开发强大的自然语言模型提高了了解蛋白序列有意义表现的能力。此外,高通量诱变、定向进化和下一代测序的进步,使得大量贴标签的健身数据得以积累。利用这两种趋势,我们引入了正规化的中层空间优化(RELSO),这是一种深层变压器的自动编码器,具有高度结构化的潜在空间,经过培训,可以联合生成序列并预测是否健康。通过常规化的预测头,RELSO引入了一种强大的蛋白序列编码器和新颖方法,以高效的健身景观穿行。我们利用RELSO,明确模拟了大型贴标签数据集的序列功能景观,并通过使用梯度法优化潜藏空间生成了新的分子。我们评估了这一方法,利用了几种公开的蛋白数据集,包括反拉尼比祖马布和GFP。我们观察到,RELSO与其他方法相比,序列优化效率更高(每最优化一步的机率提高),而RELSO则更强有力地生成高适应度序列。此外,我们利用梯度模型,通过共同学习的轨道级,为人们提供潜在的注意力定位。