Although substantial efforts have been made using graph neural networks (GNNs) for AI-driven drug discovery (AIDD), effective molecular representation learning remains an open challenge, especially in the case of insufficient labeled molecules. Recent studies suggest that big GNN models pre-trained by self-supervised learning on unlabeled datasets enable better transfer performance in downstream molecular property prediction tasks. However, they often require large-scale datasets and considerable computational resources, which is time-consuming, computationally expensive, and environmentally unfriendly. To alleviate these limitations, we propose a novel pre-training model for molecular representation learning, Bi-branch Masked Graph Transformer Autoencoder (BatmanNet). BatmanNet features two tailored and complementary graph autoencoders to reconstruct the missing nodes and edges from a masked molecular graph. To our surprise, BatmanNet discovered that the highly masked proportion (60%) of the atoms and bonds achieved the best performance. We further propose an asymmetric graph-based encoder-decoder architecture for either nodes and edges, where a transformer-based encoder only takes the visible subset of nodes or edges, and a lightweight decoder reconstructs the original molecule from the latent representation and mask tokens. With this simple yet effective asymmetrical design, our BatmanNet can learn efficiently even from a much smaller-scale unlabeled molecular dataset to capture the underlying structural and semantic information, overcoming a major limitation of current deep neural networks for molecular representation learning. For instance, using only 250K unlabelled molecules as pre-training data, our BatmanNet with 2.575M parameters achieves a 0.5% improvement on the average AUC compared with the current state-of-the-art method with 100M parameters pre-trained on 11M molecules.
翻译:尽管为AI驱动的药物发现(AIDD)使用了图形神经网络(GNNs)做了大量努力,但有效的分子代表性学习仍是一个公开的挑战,特别是在标签分子不足的情况下。最近的研究表明,在未贴标签的数据集上自我监督的学习中,大GNN模型先经过自我监督的学习,从而在下游分子属性预测任务中提高了传输性能。然而,它们往往需要大型的数据集和大量计算资源,这是耗时、计算成本昂贵和环境不友好的。为了减轻这些限制,我们建议为分子代表性学习建立一个新的培训前模式,即双布拉奇的模型内部结构变形变形器自动计算机(BatmanNet) (BatmanNet) (Bi-Bibranched Greaking ) (Bibranked) (Bider) (Blockeralder) (Order) (Order) (Order) (Order) (Order) (Order) (Order) (Order) (O) (O) (Order) (Order) (Oder) moder) moder moder moder moder moder moder moder moder moder moder) moder moder) moder) modustrational modustrations) moder) modustrations) mode dal) modustrations) modustration 。