In the field of artificial intelligence for science, it is consistently an essential challenge to face a limited amount of labeled data for real-world problems. The prevailing approach is to pretrain a powerful task-agnostic model on a large unlabeled corpus but may struggle to transfer knowledge to downstream tasks. In this study, we propose InstructMol, a semi-supervised learning algorithm, to take better advantage of unlabeled examples. It introduces an instructor model to provide the confidence ratios as the measurement of pseudo-labels' reliability. These confidence scores then guide the target model to pay distinct attention to different data points, avoiding the over-reliance on labeled data and the negative influence of incorrect pseudo-annotations. Comprehensive experiments show that InstructBio substantially improves the generalization ability of molecular models, in not only molecular property predictions but also activity cliff estimations, demonstrating the superiority of the proposed method. Furthermore, our evidence indicates that InstructBio can be equipped with cutting-edge pretraining methods and used to establish large-scale and task-specific pseudo-labeled molecular datasets, which reduces the predictive errors and shortens the training process. Our work provides strong evidence that semi-supervised learning can be a promising tool to overcome the data scarcity limitation and advance molecular representation learning.
翻译:在科学人工智能领域中,面临真实世界问题的标记数据数量有限始终是一个重要的挑战。目前的方法是在大型无标签语料库上预训练强大的任务不可知模型,但可能难以将知识转移至下游任务。在这项研究中,我们提出了InstructMol,一种半监督学习算法,以更好地利用无标签示例。它引入了一个指导模型来提供置信比率作为伪标签可靠性的衡量。这些置信分数然后指导目标模型对不同的数据点进行明显的关注,避免对标记数据的过度依赖和不正确伪注释的负面影响。全面的实验表明,InstructBio极大地提高了分子模型的泛化能力,不仅在分子属性预测方面,而且在活性悬崖估计方面也表现出优越性,展示了所提出方法的优越性。此外,我们的证据表明,InstructBio可以配备先进的预训练方法,并用于建立大规模和任务特定的伪标记分子数据集,从而减少了预测误差并缩短了训练过程。我们的工作提供了有力的证据,表明半监督学习可以是克服数据稀缺限制并推进分子表示学习的有希望的工具。