In the field of artificial intelligence for science, it is consistently an essential challenge to face a limited amount of labeled data for real-world problems. The prevailing approach is to pretrain a powerful task-agnostic model on a large unlabeled corpus but may struggle to transfer knowledge to downstream tasks. In this study, we propose InstructMol, a semi-supervised learning algorithm, to take better advantage of unlabeled examples. It introduces an instructor model to provide the confidence ratios as the measurement of pseudo-labels' reliability. These confidence scores then guide the target model to pay distinct attention to different data points, avoiding the over-reliance on labeled data and the negative influence of incorrect pseudo-annotations. Comprehensive experiments show that InstructBio substantially improves the generalization ability of molecular models, in not only molecular property predictions but also activity cliff estimations, demonstrating the superiority of the proposed method. Furthermore, our evidence indicates that InstructBio can be equipped with cutting-edge pretraining methods and used to establish large-scale and task-specific pseudo-labeled molecular datasets, which reduces the predictive errors and shortens the training process. Our work provides strong evidence that semi-supervised learning can be a promising tool to overcome the data scarcity limitation and advance molecular representation learning.
翻译:在科学人工智能领域,现实问题中标记数据数量有限一直是一个重要的挑战。目前的方法是在大型未标注语料库上预训练一个功能强大的任务无关模型,但这可能会在下游任务中遇到知识转移问题。在本研究中,我们提出了InstructMol,一种半监督学习算法,以更好地利用未标记示例。它引入了一个解释模型,以提供可靠性伪标签的置信度比率度量。这些置信度分数的指导将引导目标模型特别关注不同的数据点,避免对标记数据的过度依赖和错误伪标注的负面影响。全面的实验表明,InstructBio极大地提高了分子模型的泛化能力,不仅在分子属性预测方面,还在活性崖预估方面证明了所提出方法的优越性。此外,我们的证据表明InstructBio可以配备尖端的预训练方法,并用于建立大规模和任务特定的伪标记分子数据集,从而减少预测误差并缩短训练时间。我们的研究提供了有力的证据,表明半监督学习可以是克服数据稀缺限制和推进分子表示学习的有希望的工具。