InstructBio: 一种用于生物化学问题的大规模半监督学习范例 (InstructBio: A Large-scale Semi-supervised Learning Paradigm for Biochemical Problems)

In the field of artificial intelligence for science, it is consistently an essential challenge to face a limited amount of labeled data for real-world problems. The prevailing approach is to pretrain a powerful task-agnostic model on a large unlabeled corpus but may struggle to transfer knowledge to downstream tasks. In this study, we propose InstructMol, a semi-supervised learning algorithm, to take better advantage of unlabeled examples. It introduces an instructor model to provide the confidence ratios as the measurement of pseudo-labels' reliability. These confidence scores then guide the target model to pay distinct attention to different data points, avoiding the over-reliance on labeled data and the negative influence of incorrect pseudo-annotations. Comprehensive experiments show that InstructBio substantially improves the generalization ability of molecular models, in not only molecular property predictions but also activity cliff estimations, demonstrating the superiority of the proposed method. Furthermore, our evidence indicates that InstructBio can be equipped with cutting-edge pretraining methods and used to establish large-scale and task-specific pseudo-labeled molecular datasets, which reduces the predictive errors and shortens the training process. Our work provides strong evidence that semi-supervised learning can be a promising tool to overcome the data scarcity limitation and advance molecular representation learning.

翻译：在科学人工智能领域中，面临真实世界问题的标记数据数量有限始终是一个重要的挑战。目前的方法是在大型无标签语料库上预训练强大的任务不可知模型，但可能难以将知识转移至下游任务。在这项研究中，我们提出了InstructMol，一种半监督学习算法，以更好地利用无标签示例。它引入了一个指导模型来提供置信比率作为伪标签可靠性的衡量。这些置信分数然后指导目标模型对不同的数据点进行明显的关注，避免对标记数据的过度依赖和不正确伪注释的负面影响。全面的实验表明，InstructBio极大地提高了分子模型的泛化能力，不仅在分子属性预测方面，而且在活性悬崖估计方面也表现出优越性，展示了所提出方法的优越性。此外，我们的证据表明，InstructBio可以配备先进的预训练方法，并用于建立大规模和任务特定的伪标记分子数据集，从而减少了预测误差并缩短了训练过程。我们的工作提供了有力的证据，表明半监督学习可以是克服数据稀缺限制并推进分子表示学习的有希望的工具。

相关内容

半监督学习

关注 2924

半监督学习(Semi-Supervised Learning，SSL)是模式识别和机器学习领域研究的重点问题，是监督学习与无监督学习相结合的一种学习方法。半监督学习使用大量的未标记数据，以及同时使用标记数据，来进行模式识别工作。当使用半监督学习时，将会要求尽量少的人员来从事工作，同时，又能够带来比较高的准确性，因此，半监督学习目前正越来越受到人们的重视。

【2023新书】使用Python进行统计和数据可视化，554页pdf

专知会员服务

130+阅读 · 2023年1月29日

用蛋白语言模型改进蛋白复合物预测

专知会员服务

10+阅读 · 2022年9月25日

Nat Rev Mol Cell Bio｜用人工智能预测蛋白质结构的前景和机遇

专知会员服务

19+阅读 · 2022年5月1日

[ICLR2022]PU learning（Positive and Unlabeled learning）任务的mixup方法

专知会员服务

19+阅读 · 2022年2月2日