Generative models in molecular design tend to be richly parameterized, data-hungry neural models, as they must create complex structured objects as outputs. Estimating such models from data may be challenging due to the lack of sufficient training data. In this paper, we propose a surprisingly effective self-training approach for iteratively creating additional molecular targets. We first pre-train the generative model together with a simple property predictor. The property predictor is then used as a likelihood model for filtering candidate structures from the generative model. Additional targets are iteratively produced and used in the course of stochastic EM iterations to maximize the log-likelihood that the candidate structures are accepted. A simple rejection (re-weighting) sampler suffices to draw posterior samples since the generative model is already reasonable after pre-training. We demonstrate significant gains over strong baselines for both unconditional and conditional molecular design. In particular, our approach outperforms the previous state-of-the-art in conditional molecular design by over 10% in absolute gain. Finally, we show that our approach is useful in other domains as well, such as program synthesis.
翻译:分子设计中的生成模型往往具有丰富的参数、数据饥饿的神经模型,因为它们必须创造出复杂的结构物体作为产出。从数据中估算这些模型可能由于缺乏足够的培训数据而具有挑战性。在本文件中,我们建议为迭代创建更多的分子目标采取惊人有效的自我培训方法。我们首先先将基因模型与简单的属性预测器一起进行基因测试。然后,财产预测器作为从基因模型中过滤候选结构的可能模型使用。在随机电离层过程中,还反复生成和使用更多的目标,以最大限度地扩大候选结构被接受的对日志相似性。一个简单的拒绝(重新加权)取样器足以绘制外表样,因为基因模型在培训前已经很合理。我们在无条件和有条件的分子设计方面都展示了强大的基线所取得的巨大收益。特别是,我们的方法超越了在有条件分子设计中的先前状态,通过10%以上获得绝对收益。最后,我们表明我们的方法在其它领域也是有用的,例如合成程序。