Self-supervised learning (SSL) is a method that learns the data representation by utilizing supervision inherent in the data. This learning method is in the spotlight in the drug field, lacking annotated data due to time-consuming and expensive experiments. SSL using enormous unlabeled data has shown excellent performance for molecular property prediction, but a few issues exist. (1) Existing SSL models are large-scale; there is a limitation to implementing SSL where the computing resource is insufficient. (2) In most cases, they do not utilize 3D structural information for molecular representation learning. The activity of a drug is closely related to the structure of the drug molecule. Nevertheless, most current models do not use 3D information or use it partially. (3) Previous models that apply contrastive learning to molecules use the augmentation of permuting atoms and bonds. Therefore, molecules having different characteristics can be in the same positive samples. We propose a novel contrastive learning framework, small-scale 3D Graph Contrastive Learning (3DGCL) for molecular property prediction, to solve the above problems. 3DGCL learns the molecular representation by reflecting the molecule's structure through the pre-training process that does not change the semantics of the drug. Using only 1,128 samples for pre-train data and 1 million model parameters, we achieved the state-of-the-art or comparable performance in four regression benchmark datasets. Extensive experiments demonstrate that 3D structural information based on chemical knowledge is essential to molecular representation learning for property prediction.
翻译:自我监督学习(SSL)是利用数据固有的监督来学习数据显示的方法。这种学习方法在药物领域受到关注,缺乏附加说明的数据,因为耗费时间和昂贵的实验而缺乏附加说明的数据。使用大量未贴标签的数据的SSL在分子属性预测方面表现良好,但有几个问题存在。 (1) 现有的SSL模型是大型的;在计算资源不足的地方,执行SSL是有限的。 (2) 在多数情况下,它们没有利用3D结构信息来进行分子表示学习。药物的活动与药物分子的结构密切相关。然而,大多数目前的模型并不使用3D信息或部分使用它。 (3) 以往对分子应用对比学习的模型在分子属性预测方面表现优异,但是有一些问题存在。 因此,具有不同特性的分子可以在相同的样本中找到一个新的对比学习框架,即小规模的3D图表对比学习(3DGCL)用于分子表示上述问题的答案。 3DCLL学会通过反映分子结构结构的分子表示方式,而不是在1号测试前的化学模型中,我们只能通过测试之前的模型显示已经实现的4级的化学数据结构结构结构结构。