Molecular representation learning plays an essential role in cheminformatics. Recently, language model-based approaches have been popular as an alternative to traditional expert-designed features to encode molecules. However, these approaches only utilize a single modality for representing molecules. Driven by the fact that a given molecule can be described through different modalities such as Simplified Molecular Line Entry System (SMILES), The International Union of Pure and Applied Chemistry (IUPAC), and The IUPAC International Chemical Identifier (InChI), we propose a multimodal molecular embedding generation approach called MM-Deacon (multimodal molecular domain embedding analysis via contrastive learning). MM-Deacon is trained using SMILES and IUPAC molecule representations as two different modalities. First, SMILES and IUPAC strings are encoded by using two different transformer-based language models independently, then the contrastive loss is utilized to bring these encoded representations from different modalities closer to each other if they belong to the same molecule, and to push embeddings farther from each other if they belong to different molecules. We evaluate the robustness of our molecule embeddings on molecule clustering, cross-modal molecule search, drug similarity assessment and drug-drug interaction tasks.
翻译:分子代表性学习在化学信息学中发挥着必不可少的作用。最近,语言模型法作为传统专家设计的分子编码特性的替代方法,被广泛采用。不过,这些方法只使用单一模式来代表分子。由于一个特定分子可以通过简化分子线输入系统(SMILES)、国际理论和应用化学联合会(IUPAC)和IUPAC国际化学品识别仪(In ChI)等不同模式来描述,因此,我们建议采用一种名为MM-Deacon(通过对比学习进行多式分子域嵌入分析)的多式分子嵌入生成方法。MM-Deacon(通过对比学习进行多式分子域嵌入分析)只用单一模式来代表分子。MMM-Deacon是用SMILES和IUPAC分子表达两种不同模式来进行培训的。首先,SMILES和IUPAC的字符串通过独立使用两种不同的基于变压器的语言模型来编码,然后用对比性损失来使这些不同模式的编码表达方式更接近对方,如果它们属于同一分子,并且如果它们属于不同的分子分子,那么,则将彼此推向更远的嵌,如果它们属于不同的嵌,并且是属于不同的分子,我们药物的分子—— 将药物的分子—— 将药物的分子—— 将药物的分子—— 将药物的分子混合—— 交叉性—— 将我们的分子—— 将药物—— 混合性—— 混合性——我们评估。