InstructBio: 一种用于生化问题的大规模半监督学习范式 (InstructBio: A Large-scale Semi-supervised Learning Paradigm for Biochemical Problems)

In the field of artificial intelligence for science, it is consistently an essential challenge to face a limited amount of labeled data for real-world problems. The prevailing approach is to pretrain a powerful task-agnostic model on a large unlabeled corpus but may struggle to transfer knowledge to downstream tasks. In this study, we propose InstructMol, a semi-supervised learning algorithm, to take better advantage of unlabeled examples. It introduces an instructor model to provide the confidence ratios as the measurement of pseudo-labels' reliability. These confidence scores then guide the target model to pay distinct attention to different data points, avoiding the over-reliance on labeled data and the negative influence of incorrect pseudo-annotations. Comprehensive experiments show that InstructBio substantially improves the generalization ability of molecular models, in not only molecular property predictions but also activity cliff estimations, demonstrating the superiority of the proposed method. Furthermore, our evidence indicates that InstructBio can be equipped with cutting-edge pretraining methods and used to establish large-scale and task-specific pseudo-labeled molecular datasets, which reduces the predictive errors and shortens the training process. Our work provides strong evidence that semi-supervised learning can be a promising tool to overcome the data scarcity limitation and advance molecular representation learning.

翻译：在科学人工智能领域，现实问题中标记数据数量有限一直是一个重要的挑战。目前的方法是在大型未标注语料库上预训练一个功能强大的任务无关模型，但这可能会在下游任务中遇到知识转移问题。在本研究中，我们提出了InstructMol，一种半监督学习算法，以更好地利用未标记示例。它引入了一个解释模型，以提供可靠性伪标签的置信度比率度量。这些置信度分数的指导将引导目标模型特别关注不同的数据点，避免对标记数据的过度依赖和错误伪标注的负面影响。全面的实验表明，InstructBio极大地提高了分子模型的泛化能力，不仅在分子属性预测方面，还在活性崖预估方面证明了所提出方法的优越性。此外，我们的证据表明InstructBio可以配备尖端的预训练方法，并用于建立大规模和任务特定的伪标记分子数据集，从而减少预测误差并缩短训练时间。我们的研究提供了有力的证据，表明半监督学习可以是克服数据稀缺限制和推进分子表示学习的有希望的工具。

相关内容

半监督学习

关注 2924

半监督学习(Semi-Supervised Learning，SSL)是模式识别和机器学习领域研究的重点问题，是监督学习与无监督学习相结合的一种学习方法。半监督学习使用大量的未标记数据，以及同时使用标记数据，来进行模式识别工作。当使用半监督学习时，将会要求尽量少的人员来从事工作，同时，又能够带来比较高的准确性，因此，半监督学习目前正越来越受到人们的重视。

高效可扩展图神经网络的研究进展，Recent Advances in Efficient and Scalable Graph Neural Networks

专知会员服务

78+阅读 · 2022年3月15日

【干货书】机器学习设计模式，408页pdf，Machine Learning Design Patterns

专知会员服务

138+阅读 · 2022年2月6日

【ICML2020】拉普拉斯正则化小样本学习，Laplacian Regularized Few-Shot Learning

专知会员服务

77+阅读 · 2020年6月28日

【微软研究院】IMAGEBERT: CROSS-MODAL PRE-TRAINING WITH LARGE-SCALE WEAK-SUPERVISED IMAGE-TEXT DATA

专知会员服务

43+阅读 · 2020年1月28日