Molecule generation with desired properties has grown immensely in popularity by disruptively changing the way scientists design molecular structures and providing support for chemical and materials design. However, despite the promising outcome, previous machine learning-based deep generative models suffer from a reliance on complex, task-specific fine-tuning, limited dimensional latent spaces, or the quality of expert rules. In this work, we propose MolGen, a pre-trained molecular language model that effectively learns and shares knowledge across multiple generation tasks and domains. Specifically, we pre-train MolGen with the chemical language SELFIES on more than 100 million unlabelled molecules. We further propose multi-task molecular prefix tuning across several molecular generation tasks and different molecular domains (synthetic & natural products) with a self-feedback mechanism. Extensive experiments show that MolGen can obtain superior performances on well-known molecular generation benchmark datasets. The further analysis illustrates that MolGen can accurately capture the distribution of molecules, implicitly learn their structural characteristics, and efficiently explore the chemical space with the guidance of multi-task molecular prefix tuning. Codes, datasets, and the pre-trained model will be available in https://github.com/zjunlp/MolGen.
翻译:具有理想特性的分子生成器通过干扰性地改变科学家设计分子结构的方式和为化学和材料设计提供支持的方式而变得非常受欢迎,然而,尽管取得了有希望的成果,但前一个基于机器学习的深层基因模型依赖复杂、任务特定的微调、有限的维维维潜空间或专家规则的质量。在这项工作中,我们提出MolGen,这是一个经过预先训练的分子语言模型,能够有效地学习和分享多代代任务和领域的知识。具体地说,我们用化学语言对MolGen和SelFIES SELFIES在1亿多无标签分子上进行预设技术培训。我们进一步提议在若干分子生成任务和不同的分子领域(合成和天然产品)进行多塔斯克分子分子前缀调控,并采用自我反弹机制。广泛的实验表明,MolGen可以在众所周知的分子生成基准数据集中取得优异性表现。进一步的分析表明,MolGen可以准确地捕捉到分子的分布情况,隐含地学习其结构特征,并有效地探索化学空间,同时以多塔基分子/Mexfix前的模型指导。