There is increasing adoption of artificial intelligence in drug discovery. However, existing works use machine learning to mainly utilize the chemical structures of molecules yet ignore the vast textual knowledge available in chemistry. Incorporating textual knowledge enables us to realize new drug design objectives, adapt to text-based instructions, and predict complex biological activities. We present a multi-modal molecule structure-text model, MoleculeSTM, by jointly learning molecule's chemical structures and textual descriptions via a contrastive learning strategy. To train MoleculeSTM, we construct the largest multi-modal dataset to date, namely PubChemSTM, with over 280K chemical structure-text pairs. To demonstrate the effectiveness and utility of MoleculeSTM, we design two challenging zero-shot tasks based on text instructions, including structure-text retrieval and molecule editing. MoleculeSTM possesses two main properties: open vocabulary and compositionality via natural language. In experiments, MoleculeSTM obtains the state-of-the-art generalization ability to novel biochemical concepts across various benchmarks.
翻译:在药物发现方面,人们越来越多地采用人工智能。然而,现有的作品使用机器学习主要利用分子的化学结构,却忽视化学领域现有的大量文字知识。纳入文字知识,使我们能够实现新的药物设计目标,适应基于文字的指示,并预测复杂的生物活动。我们通过通过通过对比学习战略,共同学习分子的化学结构和文字描述,提出了多模式分子结构模型MoleculeSTM。为了培训MoleculeSTM,我们建造了迄今为止最大的多模式数据集,即PubChemSTM,配有280K化学结构-文本。为了展示MoleculeSTM的效力和效用,我们根据文字指示设计了两项具有挑战性的零弹任务,包括结构-文字检索和分子编辑。MoleculeSTM拥有两个主要特性:通过自然语言公开的词汇和构成。在实验中,MoleculeSTM获得了跨越各种基准的新型生物化学概念的最新通用能力。