We address the task of controlled generation of small molecules, which entails finding novel molecules with desired properties under certain constraints (e.g., similarity to a reference molecule). Here we introduce MolMIM, a probabilistic auto-encoder for small molecule drug discovery that learns an informative and clustered latent space. MolMIM is trained with Mutual Information Machine (MIM) learning, and provides a fixed length representation of variable length SMILES strings. Since encoder-decoder models can learn representations with ``holes'' of invalid samples, here we propose a novel extension to the training procedure which promotes a dense latent space, and allows the model to sample valid molecules from random perturbations of latent codes. We provide a thorough comparison of MolMIM to several variable-size and fixed-size encoder-decoder models, demonstrating MolMIM's superior generation as measured in terms of validity, uniqueness, and novelty. We then utilize CMA-ES, a naive black-box and gradient free search algorithm, over MolMIM's latent space for the task of property guided molecule optimization. We achieve state-of-the-art results in several constrained single property optimization tasks as well as in the challenging task of multi-objective optimization, improving over previous success rate SOTA by more than 5\% . We attribute the strong results to MolMIM's latent representation which clusters similar molecules in the latent space, whereas CMA-ES is often used as a baseline optimization method. We also demonstrate MolMIM to be favourable in a compute limited regime, making it an attractive model for such cases.
翻译:我们处理受控生成小分子的任务,这需要在某些限制(例如,与参考分子相似)下找到具有理想属性的新分子。这里我们介绍MolMIM,这是小型分子药物发现的一个概率性自动编码器,可以学习信息性和集束潜伏空间。MolMIM通过相互信息机(MIM)学习来培训,并提供不同长度的 SMILES字符的固定长度代表。由于编码-编码解码模型可以学习“洞洞”无效样本的表达方式,因此我们在这里建议对促进浓密潜层空间的培训程序进行新的扩展,并允许该模型从潜在代码的随机渗透中抽取有效分子样本。我们将MolMIM与若干可变和固定大小的编码解码模型模型进行彻底的比较,以有效性、独特性和新颖性的方式展示MIM的高级一代。我们随后利用CMA-ES,一个天真的黑箱和梯度自由搜索算法,将MolMIM的隐性空间用于财产模型的随机渗透,从而将CIM的精度的精度转化为稳定度的精度优化。我们用了一个稳定的精确的精度,我们用了一个稳定模型,我们用了一个稳定度的精确的精度的精度的精度的精度优化方法,在前的精度的精度的精度优化的精度的精度的精度上,我们用了一个精确的精度的精度的精度的精度的精度的精度的精度的精度的精度,我们的精度的精度的精度的精度的精度的精度的精度,我们的精度的精度的精度优化的精度,我们的精度的精度的精度的精度的精度的精度,我们的精度的精度的精度比的精度的精度的精度的精度的精度比的精度的精度的精度的精度的精度优化的精度,我们用力的精度的精度的精度的精度的精度的精度,我们的精度的精度的精度的精度的精度的精度的精度的精度的精度的精度的精度的精度的精度的精度的精度优化的精度的精度,