Molecular representation learning contributes to multiple downstream tasks such as molecular property prediction and drug design. To properly represent molecules, graph contrastive learning is a promising paradigm as it utilizes self-supervision signals and has no requirements for human annotations. However, prior works fail to incorporate fundamental domain knowledge into graph semantics and thus ignore the correlations between atoms that have common attributes but are not directly connected by bonds. To address these issues, we construct a Chemical Element Knowledge Graph (KG) to summarize microscopic associations between elements and propose a novel Knowledge-enhanced Contrastive Learning (KCL) framework for molecular representation learning. KCL framework consists of three modules. The first module, knowledge-guided graph augmentation, augments the original molecular graph based on the Chemical Element KG. The second module, knowledge-aware graph representation, extracts molecular representations with a common graph encoder for the original molecular graph and a Knowledge-aware Message Passing Neural Network (KMPNN) to encode complex information in the augmented molecular graph. The final module is a contrastive objective, where we maximize agreement between these two views of molecular graphs. Extensive experiments demonstrated that KCL obtained superior performances against state-of-the-art baselines on eight molecular datasets. Visualization experiments properly interpret what KCL has learned from atoms and attributes in the augmented molecular graphs. Our codes and data are available at https://github.com/ZJU-Fangyin/KCL.
翻译:分子代表性学习有助于多项下游任务,例如分子属性预测和药物设计。为了正确代表分子,图形对比学习是一个很有希望的模式,因为它使用自我监督信号,没有人类说明的要求。然而,先前的工作未能将基本域知识纳入图形语义学,从而忽略了具有共同属性但没有通过债券直接连接的原子之间的相互关系。为了解决这些问题,我们建立了一个化学元素知识图(KG),以总结元素之间的微观联系,并提出一个新的知识强化对比学习框架(KCL),用于分子代表性学习。KCL框架由三个模块组成。第一个模块,即知识引导图形增强,以化学元素KG为基础增加原始分子图。第二个模块,即知识认知图形代表,用一个共同图形编码显示的分子表达器(KGGG)来提取元素信息传递神经网络(KMPNN),以将新的知识强化反竞争学习信息。最后模块是一个对比性能目的,即我们从化学分子分子模型/CLA中获取的更深层数据。我们在KCLA级模型中演示了两个可获取的数据。