Extracting informative representations of molecules using Graph neural networks (GNNs) is crucial in AI-driven drug discovery. Recently, the graph research community has been trying to replicate the success of self-supervised pretraining in natural language processing, with several successes claimed. However, we find the benefit brought by self-supervised pretraining on molecular data can be negligible in many cases. We conduct thorough ablation studies on the key components of GNN pretraining, including pretraining objectives, data splitting methods, input features, pretraining dataset scales, and GNN architectures, in deciding the accuracy of the downstream tasks. Our first important finding is, self-supervised graph pretraining do not have statistically significant advantages over non-pretraining methods in many settings. Second, although improvement can be observed with additional supervised pretraining, the improvement may diminish with richer features or more balanced data splits. Third, experimental hyper-parameters have a larger impact on accuracy of downstream tasks than the choice of pretraining tasks. We hypothesize the complexity of pretraining on molecules is insufficient, leading to less transferable knowledge for downstream tasks.
翻译:在AI驱动的药物发现中,利用图形神经网络(GNNs)提取分子的信息展示至关重要。最近,图表研究界一直在努力复制自然语言处理方面自我监督的预培训的成功经验,并称取得了一些成功。然而,我们发现,在很多情况下,自我监督的分子数据预培训带来的好处可能微不足道。我们对GNN预培训的关键组成部分,包括培训前目标、数据分离方法、输入特征、预培训数据设置尺度和GNN结构进行彻底的对比研究,以确定下游任务的准确性。我们的第一个重要发现是,在很多情况下,自监督的图形预培训与非预培训方法相比,在统计上没有显著的优势。第二,尽管通过额外的监督前培训可以观察到改进,但随着更丰富的特点或更平衡的数据分割,改进可能会减少。第三,实验性超参数对下游任务的准确性影响大于选择培训前任务。我们假设分子前培训的复杂性不够,导致下游任务的知识转移较少。