We address the task of controlled generation of small molecules, which entails finding novel molecules with desired properties under certain constraints (e.g., similarity to a reference molecule). Here we introduce MolMIM, a probabilistic auto-encoder for small molecule drug discovery that learns an informative and clustered latent space. MolMIM is trained with Mutual Information Machine (MIM) learning, and provides a fixed length representation of variable length SMILES strings. Since encoder-decoder models can learn representations with ``holes'' of invalid samples, here we propose a novel extension to the training procedure which promotes a dense latent space, and allows the model to sample valid molecules from random perturbations of latent codes. We provide a thorough comparison of MolMIM to several variable-size and fixed-size encoder-decoder models, demonstrating MolMIM's superior generation as measured in terms of validity, uniqueness, and novelty. We then utilize CMA-ES, a naive black-box and gradient free search algorithm, over MolMIM's latent space for the task of property guided molecule optimization. We achieve state-of-the-art results in several constrained single property optimization tasks as well as in the challenging task of multi-objective optimization, improving over previous success rate SOTA by more than 5\% . We attribute the strong results to MolMIM's latent representation which clusters similar molecules in the latent space, whereas CMA-ES is often used as a baseline optimization method. We also demonstrate MolMIM to be favourable in a compute limited regime, making it an attractive model for such cases.
翻译:我们处理受控的小分子生成任务,这意味着在一定约束条件(例如与参考分子的相似性)下找到具有所需属性的新颖分子。在这里,我们介绍了MolMIM,一种用于小分子药物发现的概率自编码器,该编码器学习了一个信息丰富且聚类的潜变量空间。MolMIM使用互信息机器(MIM)学习进行训练,并提供了可变长度SMILES字符串的固定长度表示。由于编码器-解码器模型可以学习到带有无效样本“孔”的表示形式,在此我们提出了一种新型的训练程序扩展,该扩展促进了密集的潜变量空间,并允许模型从潜码的随机扰动中采样有效的分子。我们对MolMIM与几种可变大小和固定大小的编码器-解码器模型进行了彻底比较,证明了MolMIM在有效性、独特性和新颖性方面的优越生成性能。然后,我们利用CMA-ES,一种单纯黑盒且免梯度搜索算法,在MolMIM的潜变量空间中进行属性引导的分子优化任务。我们在几个受约束的单一属性优化任务以及具有挑战性的多目标优化任务中实现了最先进的结果,成功率改善超过5%。我们将强大的结果归因于MolMIM的潜在表示,该表示在潜变量空间中聚类了相似的分子,而CMA-ES通常用作基准优化方法。我们还证明,MolMIM在计算受限制的情况下具有优势,使其成为这些情况下的有吸引力的模型。