Molecular Machine Learning (ML) bears promise for efficient molecule property prediction and drug discovery. However, labeled molecule data can be expensive and time-consuming to acquire. Due to the limited labeled data, it is a great challenge for supervised-learning ML models to generalize to the giant chemical space. In this work, we present MolCLR: Molecular Contrastive Learning of Representations via Graph Neural Networks (GNNs), a self-supervised learning framework that leverages large unlabeled data (~10M unique molecules). In MolCLR pre-training, we build molecule graphs and develop GNN encoders to learn differentiable representations. Three molecule graph augmentations are proposed: atom masking, bond deletion, and subgraph removal. A contrastive estimator maximizes the agreement of augmentations from the same molecule while minimizing the agreement of different molecules. Experiments show that our contrastive learning framework significantly improves the performance of GNNs on various molecular property benchmarks including both classification and regression tasks. Benefiting from pre-training on the large unlabeled database, MolCLR even achieves state-of-the-art on several challenging benchmarks after fine-tuning. Additionally, further investigations demonstrate that MolCLR learns to embed molecules into representations that can distinguish chemically reasonable molecular similarities.
翻译:分子机器学习(ML)具有对高效分子属性预测和药物发现的承诺。然而,贴标签的分子数据可能是昂贵和耗时的获取。由于标签数据有限,这是监督学习ML模型向巨型化学空间推广的巨大挑战。在这项工作中,我们介绍了MolCLR:通过图形神经网络(GNNSs)对代表的分子差异性学习,这是一个自我监督的学习框架,利用大型无标签数据(~10M独特的分子)进行自我监督的学习。在MolCLR培训前,我们制作分子图和开发GNN编码器以学习不同的表现形式。提出了三种分子图增强:原子遮罩、取消债券和子绘图去除。一个对比性估测器通过图形神经网络(GNNSs):通过不同分子网络(GNNS)进行放大的一致学习。实验显示,我们的对比性学习框架大大改进了GNNS在各种分子属性基准(包括分类和回归任务)上的性能绩。我们从大型未标记的高级化学特征数据库前培训中受益,MLR甚至进一步向分子分子模缩缩模模化模型展示。