Generating new molecules with specified chemical and biological properties via generative models has emerged as a promising direction for drug discovery. However, existing methods require extensive training/fine-tuning with a large dataset, often unavailable in real-world generation tasks. In this work, we propose a new retrieval-based framework for controllable molecule generation. We use a small set of exemplar molecules, i.e., those that (partially) satisfy the design criteria, to steer the pre-trained generative model towards synthesizing molecules that satisfy the given design criteria. We design a retrieval mechanism that retrieves and fuses the exemplar molecules with the input molecule, which is trained by a new self-supervised objective that predicts the nearest neighbor of the input molecule. We also propose an iterative refinement process to dynamically update the generated molecules and retrieval database for better generalization. Our approach is agnostic to the choice of generative models and requires no task-specific fine-tuning. On various tasks ranging from simple design criteria to a challenging real-world scenario for designing lead compounds that bind to the SARS-CoV-2 main protease, we demonstrate our approach extrapolates well beyond the retrieval database, and achieves better performance and wider applicability than previous methods.
翻译:通过基因模型生成具有特定化学和生物特性的新分子,已成为药物发现的一个有希望的方向。然而,现有方法需要用大型数据集进行广泛的培训/调整,而大型数据集往往无法在现实世界的生成任务中使用。在这项工作中,我们提议一个新的基于检索的框架,用于可控分子的生成。我们使用一小套(部分)符合设计标准的微量分子,即(部分)符合基因模型设计标准的微量分子,指导预先训练的基因模型,将符合特定设计标准的分子合成。我们设计了一个检索机制,用输入分子检索和结合外质分子。我们设计了一个具有挑战性的现实模型,用于设计比SARS-CO-2号数据库更接近的铅化合物,并经过新的自我监督,预测输入分子的近邻。我们还提议了一个互动的改进过程,以动态更新生成的分子和检索数据库,以便更全面地概括。我们的方法对选择基因模型具有指导意义,不需要具体任务的微调。关于从简单的设计标准到我们设计具有挑战性的真实性设想,设计出比SARS-CO-2号数据库更好的前主要性方法更接近于我们的回收方法。