Recently, machine learning methods have been used to propose molecules with desired properties, which is especially useful for exploring large chemical spaces efficiently. However, these methods rely on fully labelled training data, and are not practical in situations where molecules with multiple property constraints are required. There is often insufficient training data for all those properties from publicly available databases, especially when ab-initio simulation or experimental property data is also desired for training the conditional molecular generative model. In this work, we show how to modify a semi-supervised variational auto-encoder (SSVAE) model which only works with fully labelled and fully unlabelled molecular property training data into the ConGen model, which also works on training data that have sparsely populated labels. We evaluate ConGen's performance in generating molecules with multiple constraints when trained on a dataset combined from multiple publicly available molecule property databases, and demonstrate an example application of building the virtual chemical space for potential Lithium-ion battery localized high-concentration electrolyte (LHCE) diluents.
翻译:最近,机器学习方法被用来提出具有预期特性的分子,这对有效探索大型化学空间特别有用,然而,这些方法依赖充分标记的培训数据,而在需要具有多种属性限制的分子的情况下,这些方法不切实际。从公共数据库中通常没有足够的关于所有这些特性的培训数据,特别是在AB-initio模拟或实验性属性数据也用于培训有条件分子基因化模型时。在这项工作中,我们展示了如何修改半监督的变异自动编码(SSVAE)模型,该模型仅对ConGen模型中完全标记和完全没有标记的分子特性培训数据起作用,该模型还用于培训有稀少人口特征的数据。我们评估ConGen在利用多种公开分子属性数据库的数据集进行训练时产生具有多重限制的分子的性能表现,并展示了为潜在的锂离子电池局部高浓缩电解(LHCHE)稀释剂建造虚拟化学空间的范例。