Variational autoencoder (VAE) is a popular method for drug discovery and there had been a great deal of architectures and pipelines proposed to improve its performance. But the VAE model itself suffers from deficiencies such as poor manifold recovery when data lie on low-dimensional manifold embedded in higher dimensional ambient space and they manifest themselves in each applications differently. The consequences of it in drug discovery is somewhat under-explored. In this paper, we study how to improve the similarity of the data generated via VAE and the training dataset by improving manifold recovery via a 2-stage VAE where the second stage VAE is trained on the latent space of the first one. We experimentally evaluated our approach using the ChEMBL dataset as well as a polymer datasets. In both dataset, the 2-stage VAE method is able to improve the property statistics significantly from a pre-existing method.
翻译:变化式自动编码器(VAE)是一种流行的药物发现方法,而且为了改进其性能,曾提议过许多建筑和管道,但VAE模型本身存在缺陷,例如当数据位于高维环境空间内,数据位于低维方位时,数据在高维环境空间内,数据在每种应用中都有不同的表现,而数据在药物发现中的后果是探索不足的。在本文中,我们研究如何通过二阶段VAE改进多元恢复,改进通过VAE生成的数据和培训数据集的相似性,第二阶段VAE在该阶段接受了关于第一个阶段潜在空间的培训。我们利用CHEMBL数据集和聚合数据集对我们的方法进行了实验性评估。在这两个数据集中,二阶段VAE方法能够从先前存在的方法中大大改进财产统计。